Teaching Large Language Models to Speak Spotify: How Semantic IDs Enable Personalization

Spotify’s mission is to connect every listener with the right audio and help every creator be discovered. Large language models (LLMs) present a powerful opportunity to bridge what our content expresses with what our listeners seek. By learning the shared language of our catalog and our users, LLMs can enable Spotify to personalize discovery in ways that brings creators and audiences closer together.
Now imagine a Spotify that not only recommends the right content but can also explain why a recommendation is interesting, grounding that explanation in the listener’s own preferences. This vision drives our exploration of LLMs as tools to model user intent, represent content at a deeper semantic level, and deliver recommendations that foster more meaningful connection between creators and listeners.
While existing LLMs possess broad world knowledge and strong reasoning skills, they are trained on general text rather than Spotify’s unique domain. They lack familiarity with our catalog’s structure, the intricate relationships among creators and content, the dynamics of user preferences, and the short-term shifts typical of listening behavior.
To address these gaps, we anchor our exploration of LLMs in Spotify’s core challenge: understanding how each listener’s context — their history, mood, and moment — connects to the right piece of audio. Achieving this requires rich representations of both the user’s preferences and our vast catalog of content. A naïve approach would be to describe each entity — artist, album, or podcast — solely through text, such as names, genres, and other metadata like popularity. However, textual descriptions are often inefficient and ambiguous: they can be noisy, inconsistent, and lack the fine-grained relational structure that connects items in Spotify’s ecosystem.
To help an LLM speak Spotify, we represent our catalog and listener behaviors through Semantic IDs [1,2]: compact, catalog-native identifiers that encode relationships between content and users. By domain-adapting an open-weight LLM into a purpose-built recommendation model with these Semantic ID tokens, we enable it to reason about Spotify’s catalog and listener behaviors much as they reason about words in text, learning the relationships, patterns, and semantics that connect them. This enables our model to better capture user preferences, generate explainable recommendations, and connect listeners and creators at scale through richer, more meaningful personalization.
A Spotify catalog-native vocabulary for LLMs
To personalize effectively, language models need more than to memorize Spotify’s catalog; they need a vocabulary that lets them describe and reason about it. Each entity must be broken down into its defining dimensions: the melody, mood, and energy of a track; the topic, tone, and conversational style of a podcast; or the creative signature linking an artist’s work. These dimensions form the alphabet of Spotify’s world – tokens that represent meaning rather than words.
Consider a listener who plays an audiobook on the evolution of AI from statistics to nowadays machine learning. To represent this entity meaningfully, a model must capture not only its subject matter but also its reflective tone and narrative style. When that same listener later engages with a podcast on LLMs applied in healthcare or labour market, the model should recognize the shared semantic threads: a curiosity about technology, LLMs, and societal impact. Capturing such nuances requires fine-grained, structured representations of podcasts and audiobooks that reveal connections across a user’s listening history.
Building such representations is uniquely difficult at Spotify’s scale. Our catalog spans multi-modal content, including music, podcasts, and audiobooks, each with its own structure, duration, and expressive form. Music conveys melody and rhythm; speech carries narrative and tone; podcasts range from 60-second horoscopes to two-hour interviews. This diversity makes it hard to define a unified vocabulary that LLMs can use while preserving the semantic richness of each format.
The challenge deepens when language and catalog entities must coexist within the same model, enabling it to produce both human-readable explanations and personalized recommendations. Designing representations that bridge these modalities — expressive enough for reasoning yet compact enough for modeling — lies at the core of teaching LLMs to understand Spotify’s world.
Domain-adapting an open-weight LLM for recommendations with a catalog-native vocabulary involves three key stages:
Building representations for Spotify’s world: We follow recent work and use compact, discrete Semantic ID tokens that encode semantic similarity among items. Semantic IDs provide crucial advantages such as improved generalization (similar items share tokens) [8] and greater system efficiency (discrete tokens reduce memory and bandwidth costs).
Initialization and alignment: Similarly to PLUM [9], we teach the LLM how these tokens relate to natural language, allowing it to move fluidly between text and catalog semantics.
Domain-specific training: We fine-tune the model on personalization and reasoning tasks grounded in real Spotify data, allowing it to connect language, entities, and user behavior into coherent, explainable recommendations.
Building Semantic IDs for Spotify’s world
We construct Semantic IDs for tens of millions of Spotify entities (artists, podcasts, episodes, and audiobooks) by combining two complementary types of signals that augment, rather than replace, our existing features:
Textual signals (titles, descriptions, transcripts) capture what an item is.
Behavioral signals (co-listening patterns, transitions) capture how listeners engage with it.
Each entity is first represented through embeddings learned from these signals: text encoders capture semantic meaning from language, while behavioral models capture patterns of co-consumption and user flow. We then apply an internally-developed residual Lookup-Free Quantization (LFQ) model [3] to map these continuous embeddings into discrete Semantic IDs.
The residual scheme mirrors how LLMs generate language, making it a natural fit for autoregressive modeling. The process begins with a coarse code that identifies a broad region of the embedding space (much like choosing a general topic in a conversation). Then, residual codes refine that representation step by step, adding increasingly specific details. This hierarchical refinement can be expressed as a short sequence of tokens, which aligns well with how LLMs predict tokens sequentially.
To represent different types of entities effectively, we train lightweight, type-specific quantizers, which are small residual codebooks tuned to the characteristics of each embedding space (e.g., music, podcast embeddings). These specialized quantizers share a common backbone, ensuring that representations across content types remain compatible and can be reasoned about within the same model.
To reflect the diversity of Spotify’s catalog, we currently construct separate representation spaces for different content modalities. Text-rich entities such as podcast episodes, shows and audiobooks are modeled within a shared text embedding space, while music items, whose semantic cues come primarily from audio or curation behavior, are represented in a distinct space, each with its own corresponding Semantic ID space.
This design raises open questions about how best to integrate heterogeneous content types. Exploring ways to jointly model text- and audio-driven entities remains a key direction for future work.
Initialization and Alignment of Semantic IDs
Once Semantic IDs are created, the next step is to integrate them into an LLM. Similarly to PLUM [9], we begin by expanding the tokenizer to include the new IDs, effectively extending the model’s vocabulary. This expansion also requires enlarging the token embedding matrix to accommodate the added tokens.
We explored two strategies for initializing these new embeddings: random initialization and mean-based initialization, where each new embedding is set to the average of existing ones. Based on ablation studies, we found that random initialization performed best, offering a clean starting point that avoids biasing the new tokens toward any particular linguistic distribution.
The next challenge is aligning the newly added tokens with the LLM’s existing language space so that the model can reason jointly over text and catalog entities. To achieve this, we employ a partial weight-freezing strategy: the core LLM and its original embeddings remain frozen, while only the new token embeddings are updated. Training is performed on mixed text-Semantic ID sequences, enabling the model to learn how tokens relates to language knowledge while maintaining the stability of the pretrained weights.
Through this process, the model learns to associate natural phrases such as melancholic piano, narrative journalism, or comedy show with specific, grounded entities in Spotify’s domain, effectively teaching the LLM to “speak” both human language and Spotify’s catalog language.
Domain Specific Training: Learning personalization tasks
Once the LLM is aligned, we fine-tune it on a diverse set of personalization and reasoning tasks to enable the LLM to generate personalized recommendations and explanations. These tasks include semantic search, episode recommendation, and contextual reasoning, where inputs and outputs interleave natural language and Semantic IDs.
Each prompt includes user context — such as country and recent listening history, represented by Semantic IDs — allowing the model to reason about a listener’s preferences, transitions, and discovery patterns. To prevent catastrophic forgetting (a phenomenon where fine-tuning on new tasks causes a model to lose previously learned general language capabilities), we mix a small percentage of text-only instruction-tuning data into the training recipe.
We also include synthetic data, which is data automatically generated to simulate realistic user interactions and reasoning tasks involving Semantic IDs. For instance, we create synthetic queries in natural language (“Recommend something like my recent sci-fi podcasts”) paired with corresponding Semantic ID sequences that represent relevant items. We also generate reasoning-style tasks, such as explaining a user’s interests based on their listening history. This synthetic data serves two purposes: it teaches the model how to use Semantic IDs in natural dialogue contexts, and it augments the real training data, reducing the need for costly manual annotations. This helps the model connect catalog-specific representations with natural language reasoning, bridging the gap between Spotify’s world and general LLM understanding.
Through this stage, the model learns to apply its semantic understanding to real personalization tasks, not only retrieving relevant content but also explaining why it fits a listener’s intent and context.
After these stages, our adapted open-weight LLM becomes Spotify’s catalog-native: they can generate both words and catalog entities via Semantic IDs. In essence, the LLM learns to speak Spotify: understanding and expressing the relationships between users, content, and context in the platform’s own semantic language. This unlocks deeper personalization capabilities in a pre-trained LLM (e.g., picking the right running songs) and enables finer control over outputs (e.g., running songs for me with tuned track diversity).
Evaluating LLMs that speak Spotify
We evaluated our domain-adapted LLM across multiple personalization tasks, comparing it against strong production baselines to assess whether teaching LLMs to “speak” Spotify through Semantic IDs make personalization more adaptive, intuitive, and explainable.
We adapted a 1B-parameter LLM, as it strikes an effective balance between capability and efficiency. This scale is large enough to capture complex semantic relationships while remaining lightweight for rapid experimentation and real-time inference.
The model was benchmarked across several key personalization tasks:
Task | Description |
Episode Recommendations | Predict the next podcast episode a listener will stream, both from known shows (familiarity) and new ones (discovery). |
Search | Retrieve relevant content (music or podcasts) from a natural language query. |
Playlist Generation | Generate lists of artists based on descriptive prompts. |
User understanding | Explain a recommendation using listening history or summarize a listener’s interests in natural language |
Across these tasks, our domain-tuned model consistently matched or outperformed existing production systems, demonstrating that teaching LLMs to reason in Spotify’s own semantic language, by representing the catalog through Semantic IDs and adapting to listener behavior, enables them to generalize more effectively across personalization contexts.
In episode recommendation, the model achieved up to a 1.96× improvement over baseline models, showing stronger alignment with listener intent. In semantic search, it delivered comparable performance to our production model, with notable gains in broad-intent queries where traditional keyword-based or embedding-only approaches often struggle.
Interestingly, multi-task training (e.g., fine-tuning on both recommendation and search objectives) yielded an additional 22% improvement compared to single-task setups. This finding aligns with recent work on bridging search and recommendation in generative retrieval [4]. This suggests that the model learns a shared latent structure between users and entities. In effect, it develops a transferable understanding that generalizes across personalization tasks.
Another key insight was that the quality of input text directly influences the quality of Semantic ID and overall model accuracy. For example, cleaning and refining the podcast episode descriptions with an LLM improved episode-recommendation accuracy by up to 5.4%. When text inputs are clearer, the underlying embeddings become more semantically precise. This, in turn, improves quantization (the process that maps continuous embeddings into discrete Semantic IDs) by reducing ambiguity and overlap between similar items. Better quantization produces more distinct and meaningful Semantic IDs, leading to stronger personalization performance.
Beyond quantitative gains, one of the most interesting outcomes is how the model reasons. When prompted with a query such as “a funny podcast that explores moral theories,” the LLM does not simply retrieve popular results. Instead, it infers from the listener’s history that they enjoy lighthearted yet reflective storytelling. The resulting recommendations include episodes that blend humor with ethical discussion, each accompanied by a concise natural-language explanation connecting tone, theme, and host style.
We see a similar interpretability emerging in artist recommendations. For a listener who enjoys classical music, the model links orchestras not merely by name but through shared repertoire, recording era, and stylistic lineage, tracing subtle chains of influence that mirror real-world cultural and musical relationships.
Scaling
To measure how capacity and data shape personalization quality, we scaled along two axes: model size and fine-tuning data volume.
We observed that larger model sizes and more examples during instruction fine-tuning consistently improved performance, in particular in the search task, with gains of up to 16% gains when scaling from 0.5B to 8B parameters. These results demonstrate the clear benefits of scaling LLMs fine-tuned with Semantic IDs.
However, scaling introduces important trade-offs. Larger LLMs improve semantic reasoning but also increase latency and serving costs. To balance quality with efficiency, we prioritized smaller base models that meet real-time latency budgets while maintaining strong personalization performance. This approach ensures predictable serving behaviour and product quality consistency as we scale.
During online inference, the trained model reasons directly in the Semantic ID space. Spotify’s catalog identifiers into Semantic IDs are translated on the fly for both inputs and outputs. To make this process efficient, we built a lightweight, Redis-backed key-value store [5] that rapidly resolves Spotify URIs to their corresponding Semantic IDs, and vice versa.
Serving
Having established how model scaling affects personalization quality, we next focus on how these models are served efficiently at scale. Our goal is to preserve the semantic precision achieved during training while meeting Spotify’s real-time latency requirements.
We serve the trained model using vLLM [6], which provides high-throughput inference for LLMs. At inference time, we use beam search [7] to predict Semantic IDs. Beam search achieved the best offline accuracy in our evaluations, ensuring that predicted IDs correspond to valid catalog entities. The trade-off, however, is latency: beam search is slower than sampling-based decoding but produces more accurate and valid Semantic ID predictions. To reduce the latency overhead, we are currently exploring fast constrained decoding strategies (e.g. [7]) that preserve beam search-level accuracy while approaching the speed of sampling.
Conclusion
Our goal was to teach an LLM to speak Spotify to understand the language of our catalog, our creators and our listeners. To achieve this, we domain-adapted an open-weight LLM by grounding it with Semantic IDs, giving it a way to represent Spotify’s world directly rather than through general text alone. This grounding does more than improve accuracy: it gives the model a catalog-native vocabulary to reason with, turning it into a system that can express the relationships between content, behavior, and context in Spotify’s own semantic language. The result is personalization that is not only more relevant, but also explainable, adaptive, and real-time at production scale.
This work lays the foundation for a single recommender system purposely built to connect understanding and action, interpret user intent, represent content deeply, and translate that understanding into meaningful experiences for both listeners and creators.
Looking ahead, our focus expands toward reasoning, agency, and scale:
User agency: Make personalization transparent and steerable, with natural-language controls and clear “why this” explanations built into the experience.
Continued scaling: Extend to richer multimodal understanding (text, audio, beyond) while optimizing for latency, throughput and cost through model distillation and efficient serving.
Cross-task reasoning: Leveraging signals from one domain (e.g. podcasts) to improve discovery in others (e.g. music), enabling the model to reason holistically across surfaces.
Together, these directions advance our broader vision: a model that does not just recommend, but truly speaks Spotify, enabling adaptive, transparent, and deeply personalized listening for every user. Stay tuned for our next post, where we will share how this work is powering generative recommendations for podcast listeners on Spotify.
Acknowledgments
This was a cross-Spotify effort. We’re grateful to our partners in ML Platform, Tech Research, and AI Foundation, whose collaboration took this from prototype to production - and who continue to push it forward with us. Special thanks to Martin Bomio, Joseph Cauteruccio, Keshi Dai, Juan Elenter Litwin, Martin Gould, Binal Jhaveri, Shireen Khan, Eliza Klyce, Max Lefarov, Wei-Hsiang Lin, Yves Raimond, Matthew Smith, Jan Stypka, Alexandre Tamborrino, Mark VanMiddlesworth, Todd Wasson, and Jacqueline Wood. We also appreciate the many engineers, researchers, data scientists, and PMs across these groups whose feedback shaped the modeling, serving, and evaluation behind this work. We are also thankful to Tony Jebara, whose early guidance and support helped set the direction for this work.
References
[1] Rajput, Shashank, et al. "Recommender systems with generative retrieval." NeurIPS 2023 [2] Singh, Anima, et al. "Better generalization with Semantic Ids: A case study in ranking for recommendations. Recys 2024. [3] Yu, Lijun, et al. "Language Model Beats Diffusion--Tokenizer is Key to Visual Generation." CVPR 2022 [4] Penha, Gustavo, et al. "Bridging Search and Recommendation in Generative Retrieval: Does One Task Help the Other?." Recsys 2024 [5] Redis, https://redis.io/ [6] vLLM, https://github.com/vllm-project/vllm [7] https://en.wikipedia.org/wiki/Bloom_filter [8] Singh, Anima, et al. "Better generalization with semantic ids: A case study in ranking for recommendations." NeurIPS 2024. [9] He, Ruining, et al. "PLUM: Adapting Pre-trained Language Models for Industrial-scale Generative Recommendations." arXiv preprint arXiv:2510.07784 (2025)



