Profile-aware LLM-as-a-Judge for Podcasts: A Better Middle Ground Between Offline Metrics and A/B Tests

Evaluating podcast recommendations is notoriously difficult. Offline metrics are quick but biased, while A/B tests provide rigor at the cost of time and resources. To bridge this gap, we propose a profile-aware LLM-as-a-Judge: it summarizes a listener’s tastes and asks an LLM to score candidate episodes or lists against that profile. In a 47-user study, this approach achieved 75% alignment with listener judgments and highlighted meaningful differences between two production-grade models, offering a practical middle ground between offline metrics and A/B testing.

Context

Traditional offline metrics such as hit rate or recall, are fast to compute but inherently biased by what users have already been shown. Online A/B tests, while rigorous, are slow, costly, and operationally constrained. And when exploring new experiences or features, we need ways to detect quality issues before surfacing them to listeners. We propose a middle path: leveraging LLMs as offline judges to deliver scalable and interpretable evaluations of podcast recommendations.

The core idea

We take a two-step approach. First, a user’s podcast listening history is distilled into a concise, human-readable profile. Then, an LLM is prompted to judge how well candidate episodes, or even full ranked lists, fit that profile. This “profile-aware LLM-as-a-Judge” enables accurate, scalable, and interpretable offline evaluation of long-form audio recommendations.

Why profiles instead of raw histories?

Rather than feeding raw listening logs into a prompt, the framework synthesizes a profile that highlights topical interests, stylistic preferences, and listening behaviors. This reduces prompt complexity, increases transparency (the profile can be inspected directly), and preserves alignment with human preferences. In our 47-person study, the profile-aware judge matched, or outperformed, a variant that relied on raw histories.

What the profiles capture

Each profile encodes both content preferences and listening patterns: topics, entities, exploration versus specialization, habits, depth of listening, and preferred formats. These summaries are derived from approximately 90 days of listening data.

How the pipeline works

Our evaluation framework operates in two stages.

Stage 1: User profiling

We begin by generating a natural-language profile from a listener’s recent history, roughly 90 days of podcast activity. This history includes titles, descriptions, transcripts, and topic tags, which are distilled into a concise, human-readable summary. Figure 1 illustrates this process, showing how raw listening data is transformed into a coherent profile of interests and habits.

Figure 1: Example of Profile Generation starting from raw user history.

Stage 2: Judgment

Once the profile is constructed, we prompt an LLM with two inputs: the profile itself and metadata for candidate episodes (or entire ranked lists). The LLM then produces two kinds of judgments:

Pointwise: Is this episode aligned with the listener’s interests? The model answers yes/no and provides a rationale.
Pairwise: Given two ranked lists (for example, from Model A and Model B) which one is better aligned with the profile?

This process yields both a verdict and an interpretable explanation. Figure 2 shows the full pipeline, moving from profile → candidate episodes/lists → LLM judgment → rationale and verdict.

Figure 2: Profile-aware LLM-as-a-Judge pipeline. Two stages: (1) summarize ~90 days of listening into a readable user profile; (2) prompt an LLM with the profile + episode/list metadata to output a rationale and verdict.

How well does the LLM judge align with user feedback?

We evaluated the profile-aware judge in a controlled study comparing two anonymized recommendation models: Model A, which leaned more heavily on content signals, and Model B, which emphasized collaborative filtering. Forty-seven users provided feedback on their top recommendations from the two models, yielding 277 human annotations.

For pointwise evaluations (judging whether an individual episode matched a listener’s interests) the judge reached 75% accuracy. The main source of error was false positives (17%), where the judge marked an episode as relevant even when it was not. This reflects a known optimistic bias of LLMs. Figure 3 (left) shows the episode-level confusion matrix comparing the LLM’s decisions to human annotations.

For pairwise list evaluations (comparing ranked lists from Model A vs. Model B), the LLM judge aligned strongly with users in preferring Model A. Interestingly, it was also more decisive than the users, producing fewer “tie” judgments. Figure 3 (right) shows the comparison at the model-level.

Figure 3: Confusion matrices comparing the profile-aware LLM-as-a-Judge with human annotations. Left: episode-level (pointwise) comparison. Right: model-level (pairwise) comparison.

Finally, we compared against two baselines:

Raw-history judge: an LLM prompted directly with unprocessed listening histories.
Embedding-similarity judge: a model computing sentence-embedding similarity between profiles and recommended episodes.

The profile-aware judge not only outperformed the embedding-similarity judge by a wide margin but also achieved nearly the same accuracy as the raw-history judge, while relying on a more concise and human-interpretable profile. On ROC-AUC, the profile-aware judge scored 0.6442, close to the raw-history judge (0.6476) and well above the embedding-similarity baseline (0.4871). For model selection agreement, it even surpassed the raw-history variant (0.6596 vs. 0.6170), with both far ahead of the embedding-similarity judge (0.5106). This highlights the value of summarizing user histories into interpretable profiles before prompting the LLM.

Richer profiles lead to better judgments

We also examined how the amount of listening history included in a profile affects the LLM judge’s accuracy. When profiles were expanded from just 5 episodes to 20, agreement with human judgments increased by 8 percentage points (from 0.51 to 0.59). Figure 3 (left) illustrates this sensitivity to profile coverage.

User feedback supported this trend. Most participants found their generated profiles to be reasonable summaries of their listening behavior. However, some noted that the profiles could capture more nuance, such as favorite hosts, stylistic tone, or long-term tastes—to better reflect their true preferences (Figure 3, right).

Figure 4: Left: Impact of user profile length on Judge & human alignment. Right: Human agreement on profile quality and interest alignment.

Some final words

The profile-aware LLM-as-a-Judge offers a scalable and interpretable way to evaluate podcast recommendations. In our 47-user study, it aligned closely with human judgments, particularly when comparing models or ranked lists. By summarizing listening histories into transparent profiles, it provides both accuracy and explainability, qualities that traditional offline metrics often lack.

Equally important, this approach bridges the gap between fast-but-biased offline metrics and slow, costly A/B tests. It surfaces quality issues earlier, supports nuanced model comparisons, and reduces reliance on large-scale experiments. Already in practice, it guided A/B test launches and helped steer model development toward higher accuracy.

For more information, please check our paper: Evaluating Podcast Recommendations with Profile-Aware LLM-as-a-Judg Francesco Fabbri, Gustavo Penha, Edoardo D'Amico, Alice Wang, Marco De Nadai, Jackie Doremus, Paul Gigioli, Andreas Damianou, Oskar Stål, Mounia Lalmas RecSys 2025, Late Breaking Results