Contrastive Learning-based Audio to Lyrics Alignment for Multiple Languages
Simon Durand, Daniel Stoller, Sebastian Ewert
This work describes an approach for modeling singing voice at scale by learning lowdimensional vocal embeddings from large collections of recorded music. We derive embeddings for different representations of the voice with genre labels. We evaluate on both objective (ranked retrieval) and subjective (perceptual evaluation) tasks. We conclude with a summary of our ongoing effort to crowdsource vocal style tags to refine our model.
Simon Durand, Daniel Stoller, Sebastian Ewert
Jakob Zeitler, Athanasios Vlontzos, Ciarán Mark Gilligan-Lee
Graham Van Goffrier, Lucas Maystre, Ciarán Mark Gilligan-Lee