Topological Fingerprints for Audio Identification
Wojciech Reise, Ximena Fernández, Maria Dominguez, Heather A. Harrington, Mariano Beguerisse-Díaz
Cross-modal retrieval learns the relationship between the two types of data in a common space so that an input from one modality can retrieve data from a different modality. We focus on modeling the relationship between two highly diverse data, music and real-world videos. We learn crossmodal embeddings using a two-stream network trained with music-video pairs. Each branch takes one modality as the input and it is constrained with emotion tags. Then the constraints allow the cross-modal embeddings to be learned with significantly fewer music-video pairs. To retrieve music for an input video, the trained model ranks tracks in the music database by cross-modal distances to the query video. Quantitative evaluations show high accuracy of audio/video emotion tagging when evaluated on each branch independently and high performance for cross-modal music retrieval. We also present crossmodal music retrieval experiments on Spotify music using user-generated videos from Instagram and Youtube as queries, and subjective evaluations show that the proposed model can retrieve relevant music. We present the music retrieval results at: http://www.ece.rochester.edu/~bli23/projects/query.html.
Wojciech Reise, Ximena Fernández, Maria Dominguez, Heather A. Harrington, Mariano Beguerisse-Díaz
A. Ghazimatin, E. Garmash, G. Penha, K. Sheets, M. Achenbach, O. Semerci, R. Galvez, M. Tannenberg, S. Mantravadi, D. Narayanan, O. Kalaydzhyan, D. Cole, B. Carterette, A. Clifton, P. N. Bennett, C. Hauff, M. Lalmas-Roelleke
Amar Ashar, Karim Ginena, Maria Cipollone, Renata Barreto, Henriette Cramer