Scaling Transformer-based Text-to-Speech with Knowledge Distillation

Feature Image

Transformer-based models have led to dramatic improvements in text-to-speech (TTS) quality, enabling systems that can produce expressive, zero-shot speech with high fidelity. However, these gains often come at a significant computational cost. Models inspired by large language models (LLMs) tend to be large, slow to run, and dependent on inference-time tricks like classifier-free guidance (CFG) to achieve optimal output quality.

At Spotify, we set out to make these models more practical for real-world deployment,  particularly in scenarios where latency, memory, or scalability are constraints. In this work, we explore how knowledge distillation can be used to streamline inference in LLM-style TTS models, while preserving (or even improving) output quality. Our findings are promising:

  • Removed the need for CFG at inference time

  • Achieved up to 2× faster inference

  • Reduced model size by over 50%

  • All this while maintaining comparable or improved perceptual quality

Power and practical limits of Transformer-based TTS

Recent transformer-based TTS systems such as VALL-E, AudioLM, and SpeechLM adopt LLM-style architectures that predict sequences of quantized audio tokens instead of continuous representations like mel-spectrograms. These models are capable of generating highly realistic speech across multiple speakers and prompts, including zero-shot cases where the voice is not seen during training. However, this quality comes with tradeoffs:

  • Large model sizes (hundreds of millions to billions of parameters)

  • Autoregressive inference (generating one token at a time)

  • Dependency on CFG, a technique that requires multiple forward passes through the model at inference time to improve conditioning fidelity

Classifier-free guidance (CFG) works by combining predictions from multiple forward passes through the model conditioned on different parts of the input using a linear interpolation formula with tuned weights. For example, one conditioned on the full prompt (text + reference audio), one partially conditioned, and one unconditioned. 

RS075 Scaling Transformer-based Text-to-Speech with Knowledge Distillation-1

While effective, CFG roughly doubles inference time and increases memory usage, making these models difficult to scale or deploy in real-time applications.

Our approach: Train once, run faster

We investigated whether knowledge distillation could mitigate the need for CFG and reduce model size, all without harming synthesis quality. In our setup:

  • The teacher model was a large, CFG-enabled transformer-based TTS model trained on hundreds of thousands of hours of speech.

  • The student model was a similar or smaller transformer model. This model was trained to mimic the outputs of the teacher using KL divergence loss.

Instead of learning from ground truth tokens alone, the student learns to reproduce the CFG-enhanced token distributions generated by the teacher model. This allows it to learn the effects of CFG implicitly and produce outputs that match CFG-enhanced speech quality without requiring CFG at inference time.

Removing CFG at inference time

The first key question we asked was: Can a student model trained this way match the quality of a teacher that uses CFG at inference time? To evaluate this, we compared:

  • The teacher with CFG

  • The same teacher without CFG

  • A student model of equal size, trained to mimic CFG-enhanced outputs

We ran both objective and subjective evaluations:

  • Objective metric: Word Error Rate (WER) using an automatic speech recognition model.

  • Subjective metrics: Stability and prosody, rated by human listeners on a 1–100 scale

Our findings can be summarised as follows:

  • The teacher model performed significantly worse without CFG, confirming its importance.

  • The student model without CFG achieved better prosody and stability scores than the teacher with CFG.

  • The student also had lower WER, indicating more consistent and intelligible output.

RS075 Scaling Transformer-based Text-to-Speech with Knowledge Distillation-2

Model

WER (objective)

Stability (subjective)

Prosody (subjective)

teacher w/ CFG

0.119 ± 0.056

85.83

77.35

student w/o CFG

0.059 ± 0.006

92.12

83.96

teacher w/o CFG

0.107 ± 0.031

28.99

36.74

Table 1: Teacher vs. Student of same size without CFG. 

These results show that knowledge distillation can eliminate the need for CFG at inference time, cutting runtime in half while preserving, or even improving, synthesis quality.

Investigating model behavior: Confidence and entropy

To understand why the student model outperforms its teacher in some metrics, we compared the token prediction distributions of both models. We found that:

  • The student model tends to assign more confident probabilities to the top predicted token (often >0.8).

  • The teacher model spreads probability across more tokens, possibly reflecting a more exploratory behavior.

  • As a result, the entropy of the student’s predictions is lower, which correlates with improved stability in the output audio.

RS075 Scaling Transformer-based Text-to-Speech with Knowledge Distillation-3

Figure 1: Comparison of top probability distribution between the teacher and large student on the same sentence.

These findings suggest that the distillation process may help the student model internalize and simplify the teacher’s decision boundaries, leading to more stable outputs.

Shrinking the model

Beyond eliminating CFG, we also explored how much we could reduce model size without sacrificing quality. We trained three student models:

  • Large student: same size as the teacher (~1.3B parameters)

  • Medium student: ~500M parameters (16 layers, narrower width)

  • Small student: ~180M parameters (12 layers, minimal width)

Our findings can be summarized as follows:

  • The medium model achieved quality on par with the teacher on both speaker similarity and naturalness. The small model showed a noticeable drop in quality, especially in speaker fidelity and naturalness.

  • WER remained low for the large and medium models, but increased for the small student.

RS075 Scaling Transformer-based Text-to-Speech with Knowledge Distillation-4

Model

WER 95% confidence interval (objective)

Speaker similarity (subjective)

Overall naturalness (subjective)

teacher

0.119 ± 0.056

59.66

62.72

large student

0.059 ± 0.006

59.34

62.88

medium student

0.047 ± 0.003

58.72

64.74

small student

0.28 ± 0.083

37.63

19.41

Table 2: Results from objective and subjective evaluations comparing the effect of reducing the size of student models.

This shows that we can shrink the model by more than 60% from 1.3B to 500M parameters with no perceivable loss in quality. Smaller models are easier to deploy, faster to run, and more memory-efficient.

Faster, leaner inference

We benchmarked each model's inference speed on an NVIDIA A100 GPU by measuring how many audio tokens they could generate per second.

Model

Parameters

CFG at inference

Inference Speed (tokens/sec)

Teacher

1.3B

63

Large Student

1.3B

82

Medium Student

500M

121

Small Student

180M

152

Some key takeaways include:

  • Removing CFG gives an immediate 1.3× speedup.

  • Reducing model size compounds the gain: the medium student is about 2× faster, and the small model is 2.4× faster than the teacher. Model memory usage also dropped from 14GB (teacher) to 6GB (medium) and 2GB (small).

These improvements make real-time and on-device synthesis much more feasible, without compromising quality.

Looking ahead

This work shows that knowledge distillation can do more than just compress transformer-based TTS models; it can reshape how we deploy them. By learning to approximate the output of classifier-free guidance, student models can generate high-quality speech with fewer resources, faster inference, and reduced system complexity. These improvements make advanced TTS models more usable in practical, real-world systems, especially where latency, memory, or compute capacity are limited.

As TTS systems become more deeply integrated into personalized, interactive, and always-on experiences, efficiency matters just as much as quality. This research highlights how thoughtful training strategies can bridge that gap, bringing us closer to speech synthesis systems that are both state-of-the-art and production-ready.

For more details, see our paper:  Knowledge Distillation for Transformer-based Text-to-Speech Models Erik Henriksson, Thomas Merritt, Rasmus Dall, Felix Vaughan, and Veronica Morfi Speech Synthesis Workshop (SSW13), 2025.