Contrastive Learning-based Audio to Lyrics Alignment for Multiple Languages
Simon Durand, Daniel Stoller, Sebastian Ewert
Variational autoencoders are a versatile class of deep latent variable models. They learn expressive latent representations of high dimensional data. However, the latent variance is not a reliable estimate of how uncertain the model is about a given input point. We address this issue by introducing a sparse Gaussian process encoder. The Gaussian process leads to more reliable uncertainty estimates in the latent space. We investigate the implications of replacing the neural network encoder with a Gaussian process in light of recent research. We then demonstrate how the Gaussian Process encoder generates reliable uncertainty estimates while maintaining good likelihood estimates on a range of anomaly detection problems. Finally, we investigate the sensitivity to noise in the training data and show how an appropriate choice of Gaussian process kernel can lead to automatic relevance determination.
Simon Durand, Daniel Stoller, Sebastian Ewert
Jakob Zeitler, Athanasios Vlontzos, Ciarán Mark Gilligan-Lee
Graham Van Goffrier, Lucas Maystre, Ciarán Mark Gilligan-Lee