Calibrated Recommendations with Contextual Bandits on Spotify Homepage

Streaming platforms typically offer a wide variety of content types to engage diverse user interests. On Spotify, for example, the Home page surfaces music, podcasts, and audiobooks all on the Home page. Designing ranking algorithms that balance these content types in line with each user’s evolving preferences is essential for long-term satisfaction. Done well, such systems can prevent over-narrowing of interests and instead foster discovery and exploration. Achieving this balance, however, is technically complex: historical data is skewed by popularity biases, and user tastes are both dynamic and context-dependent. To truly meet these challenges, the Spotify Home page must not only recognize individual preferences across content types, but also present the right mix of music, podcasts, and audiobooks for each user at any given moment.

Traditional approaches to calibrating recommendations typically rely on historical consumption aggregated over fixed time windows. For example, if a user’s activity over the past three months consisted of 60% music and 40% podcasts, the system would mirror this ratio in its recommendations. Yet such methods overlook finer behavioral patterns. Suppose, for instance, that 80% of this user’s podcast listening occurs between morning and afternoon, then a time-agnostic approach would misallocate impressions and fail to capture the true rhythm of their preferences. One might try to refine this by computing distributions for different times of day, yet such adjustments capture only a narrow aspect of user behavior. In reality, the interplay of short-term and long-term interests is highly complex, shaped by a multitude of contextual factors (see Figure 1), with time of day being just one among many.

RS080 Contextual Bandit at work Op Image 1

Figure 1: Illustration of dynamic content-type user preferences. Each color represents a different user cohort. The y-axis is the share of total consumption of podcasts at Spotify.

A Contextual bandit approach

To tackle this challenge, we propose a novel calibration approach based on contextual bandits. Rather than estimating content-type distributions directly from historical data, we cast the problem as supervised learning with bandit feedback. In this formulation, the system acts as a bandit that must choose the optimal distribution of content types (the action) to maximize user engagement on the Home page (the reward), given the current context. The action policy is learned from logs of user interactions, with exploration supported by an epsilon-greedy strategy; that is, the system mostly exploits the best-known action but occasionally explores alternatives to discover better options.

Context is represented through temporal signals (e.g., time of day, day of week), contextual features (e.g., device type), and both user and content embeddings. For each request, the system predicts a content-type distribution and uses it to build the recommendation slate sequentially: starting from an empty set, items are added one by one to optimize an objective that balances relevance scores with a Kullback–Leibler divergence penalty. This discourages large deviations from the predicted distribution and ensures that the final slate respects the intended content-type balance.

Empirical evaluation

We began with offline evaluations to benchmark the effectiveness of our approach against two established calibration baselines. The first baseline (SC [1]) estimates the target content-type distribution for each user from historical preferences, computed either over a short-term window (last 7 days) or a long-term window (last 90 days). The second baseline (MB [2]) constructs recommendation lists by sequentially sampling from a fixed multinomial distribution over content types, blending them in proportion to the estimated distribution.

Table 1 summarizes the results. Our contextual bandit method (CB) consistently outperformed both baselines, achieving higher podcast accuracy as well as superior overall accuracy. These gains highlight the value of dynamically adapting to context, rather than relying on static historical ratios or fixed sampling distributions. The experiments in this blog post focus on music versus podcasts, but the approach can be generalized to any number of content types.

Model	Podcast accuracy	Overall accuracy
CB vs SC (7 days)	+35%	+10.2%
CB vs SC (90 days)	+25.6%	+7.2%
CB vs MB	+16.6%	+2.8%

Table 1: Offline evaluation metrics for the proposed approach and two existing baseline methods

Encouraged by the promising offline results, we proceeded to an online A/B test. The control condition used the 90-day SC baseline, as it had outperformed the 7-day variant in offline evaluation. Table 2 reports the outcome of this experiment. Our proposed contextual bandit methodology delivered consistent improvements across multiple engagement metrics.

Model	Podcast i2s	Overall i2s	Consumption	Activity
CB	+36.6%	+3.93%	+1.28%	+1.54%

Table 2: A/B test results for the proposed approach

Specifically, we observed higher impression efficiency, measured by the impression-to-stream (i2s) ratio; increased homepage activity, measured by the fraction of users initiating a stream directly from the Home page; and greater overall consumption, measured by total minutes streamed on the platform. Together, these metrics capture not only how effectively recommendations convert impressions into streams, but also how much they encourage exploration and sustained listening.

Given the positive and robust impact across these dimensions, the system was successfully deployed at Spotify in March 2025, where it now powers calibration of content-type distributions on the Home page.

Lessons learned and looking ahead

We developed a contextual bandit approach for calibrating content types within recommendation lists, enabling the system to dynamically adapt to users’ evolving and diverse preferences. This approach is especially well suited to platforms like Spotify, where users engage with multiple content types, such as music, podcasts, and audiobooks, in varied and shifting contexts. Our experiments show that this adaptive calibration method significantly outperforms baseline strategies commonly used in industry. In particular, it improves upon approaches that rely on fixed content-type distributions set by business rules, or those estimated solely from historical consumption. By contrast, our method delivers a more personalized and context-aware user experience.

The current reward formulation emphasizes short-term engagement, that is, whether a user streams content from the Home page given a particular distribution of content types. Looking ahead, we plan to extend this framework in several directions. First, by incorporating multi-objective rewards that balance short-term engagement with long-term user satisfaction. Second, by leveraging sequential signals to improve intent prediction. Finally, by exploring upper confidence bound (UCB)–based exploration, which has often been shown to outperform epsilon-greedy strategies in balancing exploration and exploitation.

For more details about this work please refer our paper: Calibrated Recommendations with Contextual Bandits Diego Feijer, Himan Abdollahpouri, Sanket Gupta, Alexander Clare, Yuxiao Wen, Todd Wasson, Maria Dimakopoulou, Zahra Nazari, Kyle Kretschman, Mounia Lalmas RecSys CONSEQUENCES Workshop

References

[1] Harald Steck. 2018. Calibrated recommendations. In Proceedings of the 12th ACM Conference on Recommender Systems. ACM, 154–162. [2] Jan Malte Lichtenberg, Giuseppe Di Benedetto, and Matteo Ruffini. 2024. Ranking across different content types: The robust beauty of multinomial blending. In Proceedings of the 18th ACM Conference on Recommender Systems. 823–825