Contrastive Learning-based Audio to Lyrics Alignment for Multiple Languages
Simon Durand, Daniel Stoller, Sebastian Ewert
Cross-modal retrieval learns the relationship between the two types of data in a common space so that an input from one modality can retrieve data from a different modality. We focus on modeling the relationship between two highly diverse data, music and real-world videos. We learn crossmodal embeddings using a two-stream network trained with music-video pairs. Each branch takes one modality as the input and it is constrained with emotion tags. Then the constraints allow the cross-modal embeddings to be learned with significantly fewer music-video pairs. To retrieve music for an input video, the trained model ranks tracks in the music database by cross-modal distances to the query video. Quantitative evaluations show high accuracy of audio/video emotion tagging when evaluated on each branch independently and high performance for cross-modal music retrieval. We also present crossmodal music retrieval experiments on Spotify music using user-generated videos from Instagram and Youtube as queries, and subjective evaluations show that the proposed model can retrieve relevant music. We present the music retrieval results at: http://www.ece.rochester.edu/~bli23/projects/query.html.
Simon Durand, Daniel Stoller, Sebastian Ewert
Jakob Zeitler, Athanasios Vlontzos, Ciarán Mark Gilligan-Lee
Graham Van Goffrier, Lucas Maystre, Ciarán Mark Gilligan-Lee