Choice of Implicit Signal Matters: Accounting for User Aspirations in Podcast Recommendations

Podcasting as a medium is growing exponentially, with hundreds of thousands of shows available in genres from comedy to news reporting to true crime storytelling to health and wellness to education and self-directed learning. With such a great variety of content, it is natural that listeners would approach a podcast catalog with some aspirational goals (such as learning a language or eating healthier), while also desiring simpler pleasures such as entertainment or having a feeling of some connection via listening. Recommender systems help listeners find content to listen to, but how can they account for the different goals listeners might have?

Recommender systems are typically trained to some target engagement signal such as clicks, streams, likes, or a weighted combination. In this work, we dive deeper into this target choice by focusing on podcast recommendations at Spotify. There are now more than one million active podcast shows consisting of 64 million episodes. Podcast shows are distributed by RSS feeds; people subscribe to a show to automatically receive each new episode. But it is also common to “dip in” and just listen to single podcast episodes. Thus podcasts offer two strong engagement signals: show subscriptions and episode plays. Certain types of subscriptions may reflect aspirational goals listeners have; in particular, subscribing to a language-learning podcast, a news podcast, or a health and wellness podcast suggests that the user has aspirations that they hope to achieve. But episode plays do not necessarily follow from show subscriptions; the recommender system, by filtering what users see first, mediates plays, and how it does so depends greatly on the choice of target engagement signal to which to train.

In this work we take advantage of this distinguishing property of the podcast domain to tackle the problem of optimizing recommendations in the presence of multiple implicit engagement signals. We explore this problem by answering three main research questions:

What is the impact on recommendations and consumption when training to “plays” versus “subscriptions”?
What are the strongest factors that predict how listeners engage with shows?
How could we use calibration to make an informed decision and leverage both signals in recommendations?

Effect of optimization signal on the top-n recommended items and user consumption

Our goal here is to highlight the differences observed when optimizing recommendation algorithms based on different engagement signals. We use a recommendation algorithm based on deep neural networks that has shown promising results in similar recommendation applications. The framework casts the recommendations problem as an extreme multi-class classification task modeled by a multilayer perceptron. This provides flexibility in handling heterogeneous feature sets, and the approach is widely used across recommender systems that operate in large scales.

We train this model twice:

Subscribe Model: User-show pairs are assigned a positive label if the user has subscribed to a show. Users can subscribe even before listening to any of its episodes.
Play Model: User-show pairs are assigned a positive label if the user streams at least one episode of the show.

We summarize the recommendations provided by each model by using the category of a show and aggregating top shows that are recommended across users. The distribution of recommended items in each model shows large differences between categories:

attachment_441c10d3ec70b92d066e1aed68308399

attachment_acd10caaf3e3788ab83eb82dbb3c55ad — Category distribution of the top 10 recommendations from the Subscription model (top) compared to the Plays Model (bottom).

Among the main differences, we observed that shows in the “Knowledge” category are over-represented when the model is trained on subscriptions. In contrast, “Politics and Current Events” shows are over-represented in the model trained on plays.

It is apparent that the choice of implicit signal has a big impact on the type of shows being recommended to users. But does this difference in category representation impact what show categories users would end up listening to? The difference in recommended list does not guarantee a difference in exposure or consumption, as users usually do not scroll through the whole list and the choice of what items to play is impacted by other factors as well that are independent from exposure.

We deployed both of these recommenders in production and in an A/B test contrasted the user consumption for each variant. Our results showed that indeed both exposure and consumption reflect similar differences:

attachment_452f49e592062a2690b632c933960825

Studying the disparity: What factors are predictors of each engagement type?

Observing the differences between the output of the two models in the previous section, we explored what may cause such disparity. Note that the only difference between the two models was the engagement signal used to train each model.

We therefore perform a regression analysis to explore the underlying factors predicting each engagement type, plays vs subscriptions. In addition to plays and subscriptions, we investigate feedback signals derived from various levels of engagement such as “Played > 2 episodes” and “Played >7 days”. We explore the effects of two types of factors in predicting each engagement type:

Availability Related: Features corresponding to the format of a show such as release cadence and average episode length.
Content Related: Features that describe the content of each show, such as the category of a show or its theme.

attachment_35fc59754f53abfacc9ca774a7fd2b2b

A surprising learning was that length and cadence of episode release do not play a major role in how people engage with shows. However, the content of the show as reflected in category and show theme are important factors. For example, shows with a “learning” theme are more likely to be subscribed than played, no matter the length and release cadence. Also the category of a show being Sports is a strong predictor of more engagement in the form of >7 days return to the show. This analysis provided us with a lot of insights on how users engage with different categories of shows; Particularly, users’ aspirational behavior when subscribing compared to when streaming was enlightening. Optimizing for streams can bias the recommendations towards certain podcast types, undermine users’ aspirational interests and put some show categories and creators at disadvantage.

This analysis hinted that to explain and address the observed gap between what is being recommended in each model, normalization and sampling approaches based on release cadence or episode length would not be helpful. Instead we observed that each engagement signal can inform parts of a users’ interests and goals of engaging with podcasts, and to best serve users, we need to leverage both play and subscribe signals in recommendations.

But how should we do this? The question of how to optimize for multiple objectives has been long studied in its mathematical forms. In this work we are not looking for the most optimal solution but explore the question of how to simultaneously optimize for different user goals.

Next, we propose a simple and explainable solution rooted in calibration literature that can take into account both aspects of user behavior.

Optimizing recommenders to account for user goals captured across different engagement signals

We use calibration as an approach to leverage both interaction signals in the form of plays and subscriptions to address users’ various goals in podcast consumption. We first train the recommender based on plays to create a pool of shows that a user finds engaging, and next, go through the pool and calibrate the final list to reflect users’ interests captured in the category of shows they subscribed to.Specifically, we train a recommender using streams and create a pool of top 1000 recommendations along with their relevance score from this model. Going through the pool for each user, we create a recommendation list by optimizing for two objectives:

Maximizing the average score of recommended items
Minimizing the divergence of the category distribution in the recommendation list compared to the user's subscription list.

attachment_3112c7e17bf6f22e7203acd62a333f6d

We balance the two objectives by modifying a coefficient lambda. With the proper level of calibration we end up with a recommendation list that improves the accuracy when evaluated against both subscriptions and plays. The following graph shows that using lambda = 0.5, precision@10 is improved for both plays and subscriptions in offline evaluations.

Summary

In this work, we demonstrated how a simple optimization target choice between plays and Subscriptions can have a big impact on the type of shows recommended to the users; Optimizing for streams compared to subscriptions can undermine parts of users’ interests and put some show categories at disadvantage. Finally, we proposed calibration as a simple and explainable approach to leverage both sources which resulted in a recommendation list that satisfies both targets.

Check out our paper for many more details: Choice of Implicit Signal Matters: Accounting for User Aspirations in Podcast Recommendations Zahra Nazari, Praveen Chandar, Ghazal Fazelnia, Catie Edwards, Ben Carterette, Mounia Lalmas The Web Conference (WWW) 2022.

Choice of Implicit Signal Matters: Accounting for User Aspirations in Podcast Recommendations

Effect of optimization signal on the top-n recommended items and user consumption

Studying the disparity: What factors are predictors of each engagement type?

Optimizing recommenders to account for user goals captured across different engagement signals

Summary

Related articles

Personalizing Agentic AI to Users' Musical Tastes with Scalable Preference Optimization

Describe What You See with Multimodal Large Language Models to Enhance Video Recommendations

Profile-aware LLM-as-a-Judge for Podcasts: A Better Middle Ground Between Offline Metrics and A/B Tests

Semantic IDs for Generative Search and Recommendation