Personalizing Agentic AI to Users' Musical Tastes with Scalable Preference Optimization

When users ask for “music for a solo night drive through the city,” recommender systems face a fundamental challenge: popular defaults often fail to capture the nuance of situational queries. At Spotify, our goal is to build systems that learn listening preferences directly from every play, skip, save, and refinement, treating each interaction as preference feedback. This loop is especially important for subtle requests: if a user skips a high-energy pop track but listens through a moody electronic one, that behavior should become a clear signal of the desired vibe. Thus, the key question is: how do we build such a system that can create great recommended music playlists from user requests and which is capable of continuously learning from feedback to deliver better recommendations over time? In this post, we present our approach using LLM-based agentic systems that interpret queries, orchestrate tools, and adapt through preference learning, going beyond traditional recommender systems to enable more nuanced, intent-driven playlist generation.
Limitations of traditional approaches
Traditional recommender systems, often trained on fixed historical datasets, struggle to adapt to evolving user preferences or a constantly changing catalog. Periodic retraining can help, but it is slow and coarse. Many follow a pipeline approach, with multiple sequential components, making it difficult to assign credit or blame when recommendations succeed or fail.
Reinforcement learning (RL) offers a more adaptive alternative by learning directly from interaction data. However, RL systems can regress when the underlying base model is updated, since learned preferences do not always transfer cleanly. Adding to this challenge, the catalog of items is itself constantly evolving, so the system must continually adapt to shifting preferences and choices.
A hybrid approach: Reward Models + Direct Preference Optimization
To address these challenges, we developed a hybrid approach that combines two complementary methods. First, we use a reward model, drawing inspiration from Reinforcement Learning from Human Feedback (RLHF), to estimate user satisfaction for potential recommendations. This model creates preference pairs by simulating user choices. Second, these pairs are used to fine-tune the system with Direct Preference Optimization (DPO).
At the core is an LLM-based agentic system that interprets user queries, generates orchestration plans in a domain-specific language (DSL), calls tools to search and filter music, and synthesizes the results into playlists. Spotify's AI Playlist feature is one example that leverages this type of approach [1]. Importantly, users do not interact with orchestration plans (how the playlist came about) but with the playlist itself (by playing tracks). This creates a credit-assignment problem: preferences are expressed over final outcomes rather than the orchestration plan that produced them. Moreover, there is no single “correct” playlist for a given query.
Our approach to preference learning must therefore remain flexible across the different stages of the playlist creation process, while also capturing the diversity of valid outcomes. To address this, we introduce a hybrid preference optimization method that enables end-to-end optimization of the entire system.
The Preference Tuning Flywheel
Our method creates a continuous improvement cycle, a preference tuning flywheel with four stages (Figure 1):
Generate: We sample prompts from user logs and produce diverse, executable DSL orchestration plans. By removing trivial plan variations, we ensure the system learns from meaningfully distinct options, simulating how a user might choose between genuinely different playlists.
Score: A reward model estimates user preferences for each candidate plan from Stage 1, with relative calibration on signals that reflect meaningful listening behavior and long-term user satisfaction.
Sample: Preference pairs are constructed with margin constraints and hard negatives, yielding high-signal, data-efficient training examples.
Fine-Tune: Using DPO, we increase the probability of preferred responses while preserving proximity to the base model’s behavior.
This flywheel establishes a virtuous cycle, enabling stable preference alignment from real interaction data while allowing the system to adapt continuously to both user behavior and the evolving catalog.

Figure 1: Preference tuning flywheel showing the four-stage iterative process: Generate diverse DSL orchestration plans (the tool calls for playlist synthesis), Score candidates with reward model, Select high-confidence preference pairs, and Fine-Tune the agentic system using DPO. This cycle continuously improves preference alignment from user interaction data.
Why reward models matter
Standard DPO learns from preference pairs (in our case, alternative playlists) by observing user choices. However, presenting two playlists alone can encourage superficial preferences. For example, users may pick the one with familiar songs or obvious properties, rather than the one that would yield deeper satisfaction after listening. Our aim is to align the reward signal with genuine listening satisfaction. To achieve this, we train a reward model that predicts long-term satisfaction for a given user, query, and playlist, drawing on multiple signals correlated with engagement and retention (Figure 2).

Figure 2: The reward model ingests prompt, context, and playlist features to output a calibrated satisfaction probability, trained with signals aligned with long-term user value and meaningful listening behavior.
The reward model also streamlines updates by enabling offline data generation and user simulation. In practice, this means we can generate new orchestration plans and simulate preferences over them. For instance, given the query “music for a solo night drive through the city,” one orchestration plan might yield a playlist that perfectly captures the intended vibe, while another might be technically valid but less aligned with the preference. This simulation allows us to model user behavior across many possible playlists, even those the user never actually received. By sampling across different levels of difficulty, we can stress-test the system with both clear-cut and borderline cases, helping it generalize beyond the specific playlists a user happened to encounter.
Stable, scalable fine-tuning
To update the models only where strong preference signals are present, we construct positive/negative pairs using minimum score margins and hard negatives. Score margins ensure that comparisons are made only when one playlist is clearly better than another, providing unambiguous training signals. Hard negatives, by contrast, are carefully chosen alternatives that come close to, but ultimately fall short of, the preferred option, sharpening the model’s decision boundaries.
To prevent overfitting to only “easy wins,” we group candidate pairs into buckets based on margin size. High-margin buckets provide safe, reliable positives; medium-margin buckets capture subtler distinctions; and low-margin buckets introduce difficult but high-value contrasts. Sampling across these buckets balances confidence with nuance, yielding compact, high-signal data. This approach makes training more data-efficient and keeps the model robust, even when the reward function is only relatively accurate rather than perfect.
DPO with calibrated rewards works by nudging the model toward outcomes that users prefer, while keeping its overall behavior close to the base policy. Using a simple, stable loss defined over preference pairs, it trains on standard fine-tuning infrastructure. For example, given a set of two plausible playlists for the same user query (say, “music for a solo night drive”), we learn from the pairwise comparison. If playlist “A” better matches the user's preferences than playlist “B”, then the “A over B” signal becomes the pair we train on. This way, the model learns which choices align more closely with user satisfaction, and the approach gives us robust preference alignment that was straightforward to implement and maintain in production.
Online experiments
Our production A/B tests on AI Playlist generation demonstrated the tangible, real-world impact of this hybrid preference optimization approach. We observed statistically significant gains across multiple dimensions of user value: listening time increased by 4%, and users were more likely to save generated playlists. Importantly, these gains were achieved while maintaining strict quality guardrails, ensuring that improvements did not come at the expense of user experience. Beyond engagement, we also saw the system become more effective operationally: erroneous tool calls (when the orchestration layer invokes the wrong search filters or retrieval functions) were reduced by 70%, meaning the system produced cleaner, more reliable results while reducing computational overhead.
Taken together, these outcomes highlight not only the effectiveness of the method in driving user satisfaction but also its potential as a robust, scalable approach for preference alignment in production systems.
Engineering practices that made the difference
Several practices emerged as especially critical to making the system work in practice:
Training on playlists rather than orchestration plans proved critical for two reasons: it reduced the gap between offline experiments and real user behavior (users interact with playlists, not the intermediate DSL), and it decoupled preference optimization from system implementation, which is essential in a constantly evolving agentic ecosystem.
Stricter pair selection also made a bigger difference than expected. By enforcing score margins, we ensured the model only learned from comparisons where one playlist was clearly better than another. Hard negatives (close but not quite right alternatives) forced the model to refine its decision boundaries. Together, these practices raised the quality and informativeness of the training data.
Sampling strategies were essential for avoiding overfitting in DPO. Instead of using a single static threshold, we bucketed pairs by difficulty. This allowed us to capture “hard” preference pairs that would otherwise be overlooked, significantly boosting performance and improving generalization.
Execution efficiency mattered as much as modeling. Treating infrastructure as a first-class citizen, through techniques like tool-pool sizing and caching, delivered gains comparable to model improvements. In other words, training system throughput and latency had a direct impact on user outcomes.
Finally, we consistently found that simple methods, when well-calibrated, outperformed complex ones. A robust reward model paired with straightforward DPO proved more effective and maintainable than specialized objectives that were harder to tune and sustain.
Looking ahead
Our results validate the potential of preference optimization when applied to agentic recommendation systems that actively interpret intent, plan actions, call tools, and learn from feedback. By combining preference learning with robust engineering practices, we can deliver experiences that are both adaptive and satisfying for users. These findings provide a strong foundation for scaling and maintaining production-ready agentic systems that continuously improve, underscoring preference optimization as a key enabler of practical, intent-driven recommendation.
Acknowledgments
This project represents a cross-functional effort building on contributions from research, data science, modeling, agent tooling, evaluation, and platform teams.