Query by Video: Cross-modal Music Retrieval

Abstract

Cross-modal retrieval learns the relationship between the two types of data in a common space so that an input from one modality can retrieve data from a different modality. We focus on modeling the relationship between two highly diverse data, music and real-world videos. We learn crossmodal embeddings using a two-stream network trained with music-video pairs. Each branch takes one modality as the input and it is constrained with emotion tags. Then the constraints allow the cross-modal embeddings to be learned with significantly fewer music-video pairs. To retrieve music for an input video, the trained model ranks tracks in the music database by cross-modal distances to the query video. Quantitative evaluations show high accuracy of audio/video emotion tagging when evaluated on each branch independently and high performance for cross-modal music retrieval. We also present crossmodal music retrieval experiments on Spotify music using user-generated videos from Instagram and Youtube as queries, and subjective evaluations show that the proposed model can retrieve relevant music. We present the music retrieval results at: http://www.ece.rochester.edu/~bli23/projects/query.html.

View publication

Query by Video: Cross-modal Music Retrieval

Abstract

Related

Cold-Starting Podcast Ads and Promotions with Multi-Task Learning on Spotify

Learning Optimal Personalised Reservation Prices in Impression Ad Auctions with Mixture Density Networks

LLARK : A Multimodal Instruction-Following Language Model for Music

Evaluating Podcast Recommendations with Profile-Aware LLM-as-a-Judge