Generalized user representations for large-scale recommendations

Personalization lies at the heart of the Spotify platform, powering discovery, playlist generation, search, and home-page ranking for over 600 million monthly listeners worldwide. At this scale, delivering great recommendations depends on how well we capture users’ tastes, both their enduring preferences and the short-term shifts shaped by moods, moments, and new releases. As foundation models and generative recommendation experiences emerge, we need a unified user representation that is production-ready, fast to update, and transferable across tasks to serve as the backbone of personalization. In practice, this means building a representation that (1) generalizes across diverse products, (2) adapts quickly to changing behavior, (3) performs well in cold-start scenarios, and (4) scales efficiently across our systems.
To that end, we built a large-scale framework for generalized user representations at Spotify. Each listener is mapped to a high-dimensional embedding within a stable vector space that downstream systems can directly leverage, as raw features, for nearest-neighbor retrieval, or as conditioning for generative models. The framework operates in two stages: first, an autoencoder compresses multi-signal features into compact user embeddings; then, downstream products apply lightweight transfer learning on top of these embeddings. This design minimizes manual feature curation and allows teams to remain loosely coupled while sharing a common foundation. Deployed in production, the framework provides infrastructure that lets downstream models operate independently while staying aligned. Extensive online experiments show significant gains in consumption share, content discovery, and search success, alongside reductions in infrastructure costs.
Model overview
Our framework for generalized user representations follows two main stages:
Representation Learning (encoder–decoder): We train an autoencoder on rich, multi-modal user signals. The encoder produces a compact, semantically stable embedding, aka the User Representation, while the decoder reconstructs the inputs, ensuring the embedding preserves information useful across tasks.
Transfer Learning (downstream adaptation): Task-specific models for retrieval, ranking, search, and generation consume these embeddings and specialize via lightweight heads. This avoids redundant feature engineering, accelerates iteration, and keeps systems loosely coupled while sharing a common foundation.
Inputs & modality encoders
Capturing a listener’s taste at scale requires balancing signal richness with computational efficiency. To avoid exploding feature cardinality, we preprocess and pretrain catalog interactions through two modality encoders and aggregate them across three time scales:
Audio encoder: learns track embeddings directly from audio features, capturing acoustic similarity.
Collaborative encoder: learns track embeddings from playlist co-occurrence, encoding collaborative signals and behavioral proximity..
Time scales: interactions are aggregated over ~6 months (core interests), 1 month (mid-term shifts), and 1 week (fresh intent).
These 80-dimensional track embeddings are fused with contextual, demographic, and onboarding signals, and then passed through the autoencoder. This design ensures compact yet expressive user vectors that are both efficient to train and fast to serve at scale.
Autoencoder
The autoencoder compresses the fused feature set (multi-modal, multi-timescale, plus context) into a stable user embedding. The decoder reconstructs the original inputs, forcing the representation to capture information useful for many downstream tasks. To improve robustness, we apply denoising during training, making embeddings resilient to sparsity and noise. This is crucial for handling cold-start scenarios and adapting to shifts in user behavior.
The overall architecture of the model is shown in the Figure below.

Transfer learning & serving
Deploying user representations in production is not only a modeling problem but also a significant engineering challenge. To ensure correctness, we must keep all components, training, inference, and downstream adaptation, synchronized. If models update out of step, embeddings may lose semantic alignment, breaking the foundation on which transfer learning relies. To solve this, we developed a Batch Management system that coordinates updates and maintains consistency across the pipeline.
The framework addresses three key requirements:
Responsiveness
Fast inference alone does not guarantee responsiveness to evolving user preferences. If embeddings are based only on periodic batch jobs, they risk being stale, missing short-term changes like mood shifts, new interests, or reactions to fresh content. To address this, we combine batch inference with Near-Real-Time (NRT) updates:
Batch inference provides comprehensive coverage, ensuring every user has an up-to-date embedding, even if inactive.
NRT inference, triggered by user activity events (e.g., streaming a new artist), refreshes embeddings within minutes. These are processed, passed through the model, and stored in an online feature store, ensuring downstream systems can react almost immediately.
This hybrid strategy lets us balance freshness for active users with robustness and coverage for the full population.
Cold-Start Awareness
New users present a unique challenge: without listening history, embeddings risk being empty or uninformative. To solve this, we leverage onboarding signals, such as selected artists, genres, or languages, and encode them using the same embedding pipeline as established users.
These signals, combined with demographic features, enable instant personalization from day one. As behavioral data accumulates, the system gradually shifts from onboarding-based features to behavior-driven signals, ensuring a smooth transition from cold-start to fully personalized experiences.
Stability
For transfer learning to be effective, user embeddings must evolve within stable vector spaces that preserve semantic meaning over time. However, embeddings inevitably drift due to periodic retraining needed to combat model degradation. If left unsynchronized, this drift can disrupt downstream dependencies, e.g., search results may no longer align with recommendations.
Our solution is a coordinated Batch Management strategy:
Each retraining cycle generates a new set of embeddings, tagged with a unique batch ID.
Downstream models are retrained in sync with the new batch, ensuring semantic consistency.
During updates, production systems continue to serve from the previous (“legacy”) batch until the new one is fully integrated, preventing disruption.
This design guarantees that embeddings are always compared within the same semantic space, preserving alignment, ensuring consistent comparisons, and delivering uninterrupted service across the full personalization stack.
Downstream results
The advantages of our model became evident through extensive evaluation and tests and measuring impacts through the deployment.
The benefits of our framework became evident through extensive evaluation, both in offline prediction tasks and in large-scale online A/B tests. Together, these results demonstrate that generalized user representations improve accuracy, accelerate cold-start personalization, and scale across multiple downstream applications.
Predictive Accuracy
We observed significant improvements in future track streaming prediction:
Cold-start (within 4 hours of joining): +5% accuracy gain.
Longer-term (7-day prediction): +2% accuracy gain.
These results highlight the model’s ability to capture both immediate intent and more stable, enduring preferences.
Online Impact (A/B Tests)
When deployed in production, our approach delivered measurable gains across key personalization surfaces:
Candidate Generation (Home shelves): +2.9% increase in discoveries, +13% increase in item-to-stream (i2s) conversions.
Search Re-ranking: +0.06% overall improvement; +0.76% increase in podcast search success.
Home Ranking: +0.20% improvement in music discovery, +0.05% increase in consumption share.
These improvements validate that the learned representations are not only predictive but also translate into meaningful user engagement at scale.
Ablation Insights
To better understand the contribution of different components, we conducted a series of ablation studies. Results show that historical behavior, item semantics, and stable user attributes are complementary—and each materially improves representation quality:
Without onboarding signals: nDCG@50 on onboarding-aligned clusters drops by 13.8%, showing the value of early cold-start personalization.
Without modality encoders (audio + collaborative): AUC for 7-day prediction drops by 4.2%, and nDCG@50 on favorite-artist clusters drops by 37.1%, confirming that both acoustic and collaborative signals are critical.
Without static user features (e.g., registration country): nDCG@50 on country-aligned clusters drops by 12.1%, highlighting the importance of stable user attributes for anchoring personalization.
Conclusion
We developed a large-scale framework for generalized user representations at Spotify. By learning a single user vector from multi-modal, multi-timescale signals and serving it through a stable, synchronized pipeline, we improve retrieval, ranking, and search while also reducing infrastructure costs.
Our ablation studies confirm that onboarding cues, modality encoders, and stable user attributes are complementary and essential to representation quality. Online experiments further demonstrate meaningful gains in consumption share, content discovery, and search success.
Looking ahead, we are extending this representation to incorporate richer modalities, tighter real-time updates, and support for generative recommendation experiences.
For more information, please check our paper: Generalized User Representations for Large-Scale Recommendations and Downstream Tasks Ghazal Fazelnia, Sanket Gupta, Claire Keum, Mark Koh, Timothy Heath, Guillermo Carrasco Hernández, Stephen Xie, Nandini Singh, Ian Anderson, Maya Hristakeva, Petter Pehrson Skidén, Mounia Lalmas RecSys 2025 (Industry)