Joint Singing Voice Separation and F0 Estimation with Deep U-Net Architectures

Abstract

Vocal source separation and fundamental frequency estimation in music are tightly related tasks. The outputs of vocal source separation systems have previously been used as inputs to vocal fundamental frequency estimation systems; conversely, vocal fundamental frequency has been used as side information to improve vocal source separation. In this paper, we propose several different approaches for jointly separating vocals and estimating fundamental frequency. We show that joint learning is advantageous for these tasks, and that a stacked architecture which first performs vocal separation outperforms the other configurations considered. Furthermore, the best joint model achieves state-of-the-art results for vocal-f0 estimation on the iKala dataset. Finally, we highlight the importance of performing polyphonic, rather than monophonic vocal-f0 estimation for many real-world cases.

Related

September 2022 | RecSys

Identifying New Podcasts with High General Appeal Using a Pure Exploration Infinitely-Armed Bandit Strategy

Maryam Aziz, Jesse Anderton, Kevin Jamieson, Alice Wang, Hugues Bouchard, Javed Aslam

September 2022 | Interspeech

Unsupervised Speaker Diarization that is Agnostic to Language Overlap Aware and Free of Tuning

M Iftekhar Tanveer, Diego Casabuena, Jussi Karlgren, Rosie Jones

September 2022 | Interspeech

Exploring audio-based stylistic variation in podcasts

Katariina Martikainen, Jussi Karlgren, Khiet Truong