Algorithmic Effects on the Diversity of Consumption on Spotify

December 03, 2020 Published by Ashton Anderson, Lucas Maystre, Rishabh Mehrotra, Ian Anderson and Mounia Lalmas

More diverse, less diverse

On Spotify, people are spoiled for choice: there are millions of songs by millions of artists that they can listen to. To help our users sort through all this, we have recommendation algorithms that suggest songs they might like. While we know that these kinds of algorithms work very well in the short term (i.e. they’ve gotten quite good at suggesting songs that people end up listening to), the research community is much less clear on how recommendations interact with the long-term trajectories people take. In this work, we analyze our users through the lens of the diversity of their music consumption—the breadth of unique genres, categories etc. in their listening—and ask the questions: Do recommender systems influence the diversity of content that users listen to? And what is the resulting effect on the user experience?

Measuring users’ content diversity

To investigate this, we first have to define a metric that captures how “diverse” a user’s listening is. By analysing over 850 million playlists, we can determine the similarity of two tracks as the likelihood that they co-occur in the same playlist. We then define the diversity of a user’s listening as the average similarity between their streamed tracks—the more similar the tracks in their listening history, the less diverse their listening patterns are considered for the purposes of this study. We refer to this as the Generalist-Specialist score (GS-score), where low GS-scores correspond to diverse listening and high GS-scores correspond to narrow listening.

Is this notion of user diversity consistent over time? For over 100 million Spotify Premium users, we computed their listening diversity scores one year apart, in July, 2018 and July, 2019, and compared them.

Distribution of diversity score in 2019 versus diversity score in 2018 for a sample of premium users. Listening diversity is quite stable over time.

As seen in the Figure, diversity scores are remarkably stable even one year apart, suggesting they capture a meaningful aspect of user listening.

Listening diversity and user outcomes

How does diversity relate to important user outcomes? Are users who listen diversely more or less satisfied than those who listen narrowly? Here we looked at two important business metrics: user conversion and user retention. We performed a correlational analysis: for a given level of diversity, how often do people using the Free version of Spotify convert to a Premium subscription, and how often do Premium users stay on Spotify? Since engagement is such a strong predictor of both these metrics, we control for the number of times a user streamed, so that we only compare users at the same activity level.

Probability of converting from free to premium after one year as a function of July 2018 diversity. We use the generalist-specialist score (GS-score) as the diversity metric: Low GS-score = diverse listening, high GS-score = specialized listening.

The above plot shows how the conversion rate varies with the GS-score, our metric of listening diversity (lower values correspond to higher diversity), relative to the overall average conversion rate for several different activity levels. Among relatively active users, people with more diverse listening habits were 25 percentage points more likely to convert from Free to Premium than those with less diversity in their music consumption.

Probability of churning
Probability of churning after one year as a function of July 2018 diversity. We use the generProbability of churning after one year as a function of July 2018 diversity. We use the generalist-specialist score (GS-score) as the diversity metric: Low GS-score = diverse listening, high GS-score = specialized listening.

The above plot shows a similar analysis for user churn: Users with diverse listening are between 10-20 percentage points less likely to churn than those with less diverse listening, controlling for activity level. So we see that listening diversity is associated with user conversion and retention.

The effect of programming on consumption diversity

We have just seen that diverse listening is strongly associated with long-term user metrics. How do algorithmic recommendations affect the diversity of what users listen to? We categorized users’ music consumption into user-driven, organic streams (e.g., those coming from user-curated playlists) and algorithm-driven, programmed streams (e.g., those coming from our popular “Discover Weekly” playlists). As part of this study, for every Premium user we separately computed the diversity of both their programmed and organic streams during a 28-day period in July, 2019.

Distribution of organic vs. programmed similarity (lower values correspond to more diverse listening). Organic streaming is more diverse than programmed streaming for a majority of our premium users.

As the above plot shows, most users are more diverse in their organic streaming, and less diverse in their programmed streaming. It’s important to note, that the metric by which we’re assessing musical diversity—a track’s likelihood of being on the same playlist as another—is a mechanism that’s explicitly incorporated into some of our algorithms as a way of making relevant recommendations, which has shown success in listener satisfaction.

We saw earlier that users typically don’t change the diversity of their listening — but what happens when they do? When someone increases the diversity of their listening, what play-contexts are they using compared to someone who decreases their diversity?

Log-odds ratios of streaming by diversity-seekers versus diversity-avoiders as a function of play context. When a user increases their diversity, they typically do so by listening to their own collections more.

To answer this question, we conducted an analysis of two kinds of users: those who increased their diversity from 2018 to 2019, and those who decreased their diversity from 2018 to 2019. For these two sets, we analyzed their usage by play context, and measured which group was more likely to stream from which play context. The results are quite clear: when a user increases their diversity, they typically do so by listening to their own collections more.

How do users respond to algorithmic recommendations depending on their consumption diversity?

As mentioned earlier, the fact that recommendation algorithms work well in the short term (e.g. by finding songs that users like) is well-known. But how do people respond to recommendations based on their personal level of listening diversity?

To answer this, we ran a randomized experiment to measure how the short-term benefits of recommendation algorithms vary for those who listen broadly versus those who listen narrowly. We tested three simple recommendation approaches, and compared how “generalists” (broad listeners) and “specialists” (narrow listeners) responded to them.

Relative performance of the different ranking algorithms in our online experiment. Recommending using the “Relevance” and “Learned” models (personalized recommendations) worked much better than simply using popularity (an unpersonalized baseline), and they benefited specialists much more than specialists.

As expected, our personalized recommendation algorithms (“Relevance” and “Learned”) worked far better than the unpersonalized baseline (“Popularity”) for both generalists and specialists. Interestingly, they benefited narrow listeners much more than broad listeners. From this we can infer that current approaches to recommendations perform better for those with less diversity in their listening habits  than for those with more diverse tastes.

These findings may not be surprising; algorithmic recommendations (not just those at Spotify nor just within music, but as a category) try to give suggestions near what you’ve already liked: “you like A, B is similar to A, maybe you’ll like B?”. We also cannot assume that programming caused users to listen to less diverse content than they otherwise would have listened to. People who naturally have less eclectic tastes may simply find programming options more appealing. Second, listening diversity is strongly associated with user conversion and user retention.

Our findings are correlational, and are consistent with a variety of potential causal explanations (for example, it is possible that listening diversity and retention are caused by a love of music, and that users seek diversity through modes other than algorithmic recommendations). Nevertheless, our findings show some interesting patterns, and we’re planning for further research to get better insights into the causal relationship between listening diversity and engagement.

More can be found in our paper “Algorithmic Effects on the Diversity of Consumption on Spotify” published at The Web Conference 2020.