Mining Labeled Data from Web-Scale Collections for Vocal Activity Detection in Music

Abstract

This work demonstrates an approach to generating strongly labeled data for vocal activity detection by pairing instrumental versions of songs with their original mixes. Though such pairs are rare, we find ample instances in a massive music collection for training deep convolutional networks at this task, achieving state of the art performance with a fraction of the human effort required previously. Our error analysis reveals two notable insights: imperfect systems may exhibit better temporal precision than human annotators, and should be used to accelerate annotation; and, machine learning from mined data can reveal subtle biases in the data source, leading to a better understanding of the problem itself. We also discuss future directions for the design and evolution of benchmarking datasets to rigorously evaluate AI systems.

Related

September 2022 | Interspeech

Unsupervised Speaker Diarization that is Agnostic to Language Overlap Aware and Free of Tuning

M Iftekhar Tanveer, Diego Casabuena, Jussi Karlgren, Rosie Jones

September 2022 | Interspeech

Exploring audio-based stylistic variation in podcasts

Katariina Martikainen, Jussi Karlgren, Khiet Truong

July 2022 | SIGIR

What Makes a Good Podcast Summary?

Rezvaneh Rezapour, Sravana Reddy, Ann Clifton, Rosie Jones