Scaling Transformer-based Text-to-Speech with Knowledge Distillation

Abstract

Large language models (LLMs) have recently sparked significant improvements across a large number of domains, including text-to-speech synthesis. The move from more traditional text-to-speech models to LLM models, however, is not simple for many applications as these models are much larger than traditional models and require the use of Classifier-Free Guidance (CFG) for optimal quality. This potentially limits the applications for which LLM text-to-speech models are suitable. In this paper, we aim to address these issues by exploring the use of knowledge distillation for transformer-based text-to-speech models. Namely, we investigate using knowledge distillation to directly optimize to the output of CFG, removing its need at inference time. In addition, we explore using knowledge distillation to significantly reduce the model size required. Altogether, we were able to reduce the model size by half, double inference speed, and remove the need for CFG without any perceivable drop in voice quality.

View publication

Scaling Transformer-based Text-to-Speech with Knowledge Distillation

Abstract

Related

Learning Optimal Personalised Reservation Prices in Impression Ad Auctions with Mixture Density Networks

LLARK : A Multimodal Instruction-Following Language Model for Music

Evaluating Podcast Recommendations with Profile-Aware LLM-as-a-Judge

Semantic IDs for Joint Generative Search and Recommendation