Contrastive Learning-based Audio to Lyrics Alignment for Multiple Languages
Simon Durand, Daniel Stoller, Sebastian Ewert
Podcasts are conversational in nature and speaker changes are frequent—requiring speaker diarization for content understanding. We propose an unsupervised technique for speaker diarization without relying on language-specific components. The algorithm is overlap-aware and does not require information about the number of speakers. Our approach shows 79% improvement on purity scores (34% on F-score) against the Google Cloud Platform solution on podcast data.
Simon Durand, Daniel Stoller, Sebastian Ewert
Gustavo Penha, Enrico Palumbo, Maryam Aziz, Alice Wang, and Hugues Bouchard
Jakob Zeitler, Athanasios Vlontzos, Ciarán Mark Gilligan-Lee