Cold-Starting Podcast Ads with Multi-Task Learning

One Podcast Ecosystem, Many Objectives
Podcasts are one of Spotify’s fastest-growing content verticals, with over 750 million users discovering shows across listening surfaces, search, home recommendations, ads, and in-app promotions. For creators, early traction matters: a handful of additional streams, clicks, or follows can determine whether a new podcast ever reaches an audience. For Spotify, effective personalization across these touchpoints is essential for both user experience and monetization.
Two major channels connect listeners to podcasts: podcast advertisements, such as in-stream audio or video ads, and podcast promotions, such as display placements highlighting strategically important or relevant shows (for example, Spotify Originals). Although these channels differ in presentation and immediate objectives, they share a common underlying goal: matching the right user with the right podcast at the right moment. A listener who streams a podcast after seeing a promotion is exhibiting the same latent preference as one who streams after hearing an ad.
Historically, we optimize these surfaces with separate, task-specific models. A promotion model might predict the probability of a stream from a homepage shelf, while an ads model predicts streams or clicks from audio and video ads. This siloed setup introduced several persistent problems. New objectives were slow to launch because every product change required building and validating a new model. Cold-start issues pose another challenge, as new ad products or new creators lack enough data for specialized models. Finally, rich signals from promotions could not easily inform ads, and vice versa, leading to missed opportunities for transfer learning. These challenges motivated our hypothesis: a unified model that jointly learns from both podcast ads and promotions, that can improve on cold-start and overall performance.
The Core Idea: A Unified Multi-Objective Podcast Model
In this work, we propose a unified multi-task learning (MTL) framework that jointly optimizes podcast ads and promotions within one model. Instead of treating each channel as an independent system, we model them as related prediction tasks that share user, content, and context representations. Concretely, the model predicts multiple outcomes for a user–podcast pair: whether the user streams after a promotion impression, whether the user streams after an ad impression, and whether the user clicks, likes, or follows the podcast. All of these tasks are learned together using a shared representation, with task-specific prediction heads on top.
This design allows the system to transfer knowledge from data-rich tasks to data-scarce ones, bootstrap new objectives quickly, and consolidate previously separate pipelines into a single maintainable system.
From Siloed Models to a Joint Architecture
Before arriving at the unified solution, we explored several intermediate approaches. The original production promotions model used a shared encoder feeding multiple task-specific towers to predict streams, clicks, likes, and follows from promotion impressions. Features included user listening and follow history, podcast and episode identifiers and embeddings, contextual signals such as surface and time, and promotion metadata such as slot and layout. This model was also temporarily reused to score ad impressions for cold-start ad objectives, but it lacked ad-specific features and channel-specific signals.
In parallel, a separate ads-only model was trained on podcast ad impressions across audio, video, and display. It used similar user and content features, augmented with ad metadata such as creative and campaign. While strong on ads, this model did not generalize well to promotions and did not provide a path toward unified multi-objective optimization.
Maintaining both pipelines duplicated engineering effort and prevented systematic knowledge sharing. These limitations motivated a single joint model for both ads and promotions.
Joint Ads–Promotions Modeling
The unified model treats every training example as an impression associated with either the ads source or the promotions source. A shared encoder maps user, podcast, context, and creative features into a dense representation. On top of this representation, task-specific towers produce probabilities for each outcome (streams, clicks, likes, follows).
Architecturally, this follows a simple principle: share as much as possible in the representation, but preserve task-specific capacity at the prediction layer. In practice, this gives the model enough flexibility to capture channel-specific patterns while still benefiting from large-scale shared learning.
Controlling Transfer Between Channels
Joint training introduces the risk of negative transfer, where gradients from one source harm performance on another. We address this with two complementary mechanisms.
First, we apply directional loss masking. Promotion impressions are allowed to update both promotion-related tasks and ad-related tasks, while ad impressions update only ad-related tasks. Intuitively, promotion data contains rich signals about user–podcast affinity that are broadly useful, whereas ad-specific effects should not distort promotion learning.
Second, we use source-balanced sampling. Each training batch contains roughly equal numbers of ad and promotion impressions, ensuring that neither source dominates optimization simply due to volume.
Together, these techniques stabilize training and ensure that transfer remains largely positive.
The Role of Ancillary Objectives
A key part of the design is the use of ancillary prediction tasks for actions that a user can take on Spotify, such as clicks, likes, and follows. These events offer additional, complementary signals which already power Spotify’s recommendations engine but particularly for ad interactions, can substantially improve model performance, particularly for ad interactions. This is because in a shared encoder, the extra supervision helps the model learn richer user/content representations that improve the harder stream-prediction tasks. Importantly, ancillary heads are most effective when combined with true cross-channel training; by themselves they cannot replace joint modeling of ads and promotions.
The offline ablations make this especially clear. In the ads-only setting, adding ancillary heads improves Ads Average Precision (AP) substantially (from +27.0% to +46.5% relative to the promotions baseline), but Promotions AP remains severely degraded (around -65%). Removing the ancillary heads from the Promotions baseline model also led to worse results across both Promotions AP (-7.9%) and Ads AP (-8.8%). In other words, ancillary labels help, but they do not replace cross-channel learning. The best results come from combining ancillary objectives with joint Ads–Promotions training.
Offline Evaluation
We evaluated offline performance using Average Precision (AP), which is more informative than AUC-ROC for these tasks because stream outcomes are highly imbalanced. Training data came from production ads and promotions logs over a multi-month period, with a temporal split (older impressions for training, intermediate for validation, newest for test) to better match real deployment conditions. All models used the same optimizer (Adam) and learning-rate schedule, with hyperparameters tuned on validation AP for stream tasks.
Task Setup | Promotions Avg. Precision (Relative change to Promotions baseline) | Ads Avg. Precision (Relative change to Promotions baseline) |
Promotions Stream-head only | -7.9% | -8.8% |
Ads-only model | -65.2% | +27.0% |
Ads-only model + Ancillary heads | -64.8% | +46.5% |
Unified Promotions + Ads 5-task MTL | +4.5% | +50.2% |
Compared with the promotions-only baseline, the unified 5-task MTL model delivered the strongest overall performance in offline experiments, with +4.5% on Promotions AP, and +50.2% on Ads AP. Per our expectation, the ads-only models improved Ads AP, but performed poorly on Promotions AP, with a marginal impact from the addition of ancillary heads. The joint model is the only one that improves both sides meaningfully at the same time.
Online A/B Test Results
We ran budget-split A/B tests across more than 180 markets comparing the unified model with the baseline system that used the promotions model for ad cold-start. The results were statistically significant (p < 0.05) and improved multiple business metrics simultaneously. Specifically, we observed an average ~18% increase in impression-to-stream rate, along with approximately 20% lower effective cost-per-stream (eCPS), a quantitative metric for measuring an advertiser’s return on ad spend (RoAS), resulting in an overall uplift of roughly 18% in total streams. Results may vary by market, campaign configuration, and podcast characteristics.
The gains were even larger for emerging and lower-listened creators. In this segment, we observed approximately 24% higher impression-to-stream rate, a 22% reduction in effective cost-per-stream, and 27% more total streams. These results indicate that the unified model is particularly effective in cold-start scenarios.
Cold-Start Behavior by Popularity
When podcasts are bucketed by recent listening volume, improvements grow monotonically as popularity decreases. High-stream podcasts see modest gains, while low-stream podcasts see dramatic improvements, with impression-to-stream rate increases reaching up to ~60% and large reductions in cost-per-stream. This pattern provides strong evidence that transfer learning from promotions and ancillary signals is the primary driver of cold-start performance.
Key Takeaways
A single unified multi-task model can outperform a collection of specialized systems for podcast ads and promotions. Directional transfer and balanced training are crucial to making this work in practice. Ancillary engagement signals materially improve stream prediction, and the largest benefits accrue to cold-start creators.
Beyond metric improvements, the unified approach simplifies system architecture, reduces duplicated pipelines, and accelerates the launch of new podcast ad and promotion products.
Where We’re Heading
This unified model provides a strong foundation for podcast targeting at Spotify. Ongoing work explores extending the approach to additional verticals such as music, audiobooks, and video, adding new objectives without redesigning the core architecture, and further improving efficiency and representation learning. For platforms that must optimize heterogeneous discovery and monetization surfaces over the same content, unified multi-task learning offers a scalable and durable solution.
For more information, please refer to our paper:
Cold-Starting Podcast Ads and Promotions with Multi-Task Learning on Spotify WSDM 2026, Boise, ID, USA
Acknowledgements: We’d like to thank a number of collaborators who contributed to this work, including Cindy Natarajan, Shubham Bansal, Leo Lien, Gordy Haupt, Barry O’Neill, Zofia Trstanova, David Gustafsson, Raphaëlle Bertrand-Lalo, Per Berglund, Santiago Cassalett, Sunday Folorunso and Joachim Toft.



