You Say Search, I Say Recs: A Scalable Agentic Approach to Query Understanding and Exploratory Search at Spotify

Feature Image

Personalized recommendations are at the heart of Spotify’s experience. Features like Home and Autoplay learn user preferences and help hundreds of millions of people to discover their next favorite thing every day. 

What about more intentional discoveries though, expressed in the form of exploratory searches? For instance, what if I’m in the mood for some Italian 80s disco nostalgia? Or for a freshly released dark romance story with a plot twist at the end? Or maybe I’m looking to explore some of the latest albums by an artist I just heard about from my best friend? 

Search systems have traditionally struggled with these types of exploratory searches, focusing on textual and semantic retrieval and on helping users find and play content they already know [3]. Agentic technologies offer a path to re-think search systems. Harnessing the outstanding language understanding abilities of modern LLMs [2] and Spotify’s production APIs as tools [5] that can be invoked based on the inferred user’s intent, agentic systems can route queries to appropriate actions and sub-systems, going beyond retrieval and textual matching. However, deploying such systems at scale poses challenges, particularly in environments like Spotify, where results must be personalized for millions of users across diverse markets. 

In this work, we present a scalable agentic approach to query understanding and exploratory search at Spotify. Using an LLM router, post-training techniques, and a set of tools involving search and recommendation APIs and sub-agents, we successfully deployed and evaluated this system at scale. Our experiments showed substantial offline improvements across several search use cases. 

As we continue to iterate and experiment on the deployment of this technology, we have also observed online gains in targeted use-cases (e.g. +3% success in searching for new releases), and have rolled out this system in selected markets and experiences.

LLM router 

A key challenge in designing agentic workflows lies in balancing the generality and scalability of the orchestration process. While several general orchestrators for agents have been proposed [9], these systems, although flexible, often require multiple sequential calls between the LLM and various tools, resulting in high latency. To address this, alternative approaches have advocated for simplified architectures where an LLM router directs the query to the most suitable tool, avoiding nested LLM invocations and significantly reducing overhead [4]. 

We propose an intermediate architecture—“Parallel Fusion Router” (PFR)—that strikes a balance between the flexibility of full orchestration and the efficiency of a simple router (Figure 1). The system takes as input a user query along with associated features. The query is processed by an LLM router, which selects the most appropriate route based on the inferred intent. A route may involve either a single tool call or multiple parallel tool calls. 

You Say Search I Say Recs Image 1

Figure 1. Parallel Fusion Router (PFR): The LLM router picks a route based on the query intent. Each route corresponds to one or multiple tool calls that visually map to different sections of a structured SERP (Search Engine Results Page). As illustration, we show two possible SERP layouts.

In the case of a single call, the tool processes the query and returns results that are directly presented to the user. When multiple calls are appropriate, they are issued in parallel, and their results are mapped to different sections of the SERP (defined earlier). This setup is particularly valuable when the query suggests several plausible user intents. For instance, for a query “latest albums by Lady Gaga,” the router may hypothesize that the user wants to find the most recent album, that they are unsure whether an album has been released and are looking for confirmation or a release date, or that they are actually searching for a recently released single and are using the term “album” as a navigational cue. These different intents are fulfilled by triggering three parallel tool calls, with their outputs organized into separate SERP sections. In this way, the PFR not only improves accuracy for the most likely intent through better query understanding but also generates a fused response that captures alternative, less explicit user goals in a coherent manner. 

PFR supports two operational modes: pre-fusion and post-fusion. In the pre-fusion configuration, multi-call routes are predefined as bundles. This approach is faster and more efficient, as the LLM only needs to generate the parameters for a known route, minimizing the number of output tokens. In contrast, the post-fusion mode allows the LLM to dynamically select and combine tools at runtime, offering greater flexibility but at the cost of increased latency. For our use case, we adopt the pre-fusion approach, as it best meets our scalability requirements while maintaining system responsiveness. 

To ensure scalability, the router leverages caching, enabling low latency responses and significantly reducing computational costs. By passing only the query (without user features) to the LLM router, we maximize the cache hit rate. When needed, user features are forwarded to the downstream tools, which handle personalization and contextualization of the results. These tools include traditional machine learning-based retrievers and rankers for search and recommendation, such as models that capture user-item or item-item similarities. They also include sub-agents (for example, technologies like AI DJ) that invoke LLMs to perform additional reasoning on the query or refine the results. This hybrid setup ensures that LLMs are invoked only when necessary; for example, when fulfilling the intent requires external knowledge, complex reasoning, or a final SERP refinement step to enhance aspects like relevance or content diversity. 

Beyond selecting the route, the LLM router also determines the optimal parametrization of the tool calls, in line with standard agentic design principles [8]. This offers several advantages, as the LLM inherently performs key query understanding tasks, such as rewriting, expansion, and facet extraction, during the formulation of the downstream service call. For example, given a query “new indie rock releases,” the LLM router might enrich the request by adding temporal, genre, and entity-type filters, producing a structured call such as: {"route": "personalized_recs", "genre": "indie rock", "entity_types": ["track", "album"], "max_weeks_from_release": 4 } . This approach highlights a major strength of LLM-based query understanding [1], which contrasts with traditional systems that require multiple dedicated classifiers and re-rankers to achieve similar functionality [6]. 

Finally, each route is linked to specific visual components of the SERP. For instance, a broad, recommendation intent requires a page layout designed to support content exploration, so will involve richer visual cues and contextualization. We show two instances of possible layouts in Figure 1. 

Post-training 

In addition to caching, a key requirement for system scalability is achieving high routing accuracy while relying on a small, efficient LLM. We address this by applying post-training techniques (Figure 2). The process begins with a training dataset composed of both real and synthetic examples, held out from the evaluation set to ensure unbiased assessment. 

We then prompt a powerful LLM–serving as a teacher model–to generate the correct routing decision for each example, using a carefully crafted prompt that includes task-specific instructions and manually curated few-shot examples. To encourage diverse outputs, we sample responses from the teacher model using a high temperature setting. These candidate routes are then passed through a Rejection sampling Fine-Tuning (RFT) approach [10], where a reward model is used to filter out low-quality samples. For this purpose, we leverage an LLM-as-a-judge evaluator to assess each candidate. Sampling continues until we obtain a response that receives a PASS score, which is then used as a fine-tuning data point for the smaller LLM router. 

This post-training strategy enables the router to match the performance of much larger models while being significantly more efficient. Our experiments show a 60% reduction in latency and a 99% reduction in cost, along with an improvement in quality (as measured by LLM-as-a-judge PASS rates) compared to the teacher LLM itself (Table 1). Additionally, post-training leads to faster inference and lower cost even compared to the same model used in a prompt-based setup, due to reduced input context. 

LLM router (prompting)

LLM router (post-training)

Quality

-5%

+3%

Latency

-53%

-60%

Cost

-93%

-99%

Table 1: Offline relative improvements over Teacher LLM introduced by post-training.

You Say Search I Say Recs Image 2

Figure 2: Post-training Approach for the LLM router. Routes are sampled from a teacher LLM and filtered using rejection sampling by an LLM-as-a-judge model, generating synthetic training data. A smaller model is trained on this data, achieving on-par accuracy with the teacher model, while being much more efficient.

Evaluation 

To evaluate the effectiveness of the system, we adopt an LLM-as-a judge approach [7]. This involves using a powerful language model with access to real-time world knowledge, prompted to assess the quality of results across several dimensions, including relevance, diversity, and freshness. The input to the judge includes the query, the user profile, and the items returned in the SERP, each annotated with metadata such as title, artist, genre, and release date. 

The evaluation prompt is constructed using a set of clear instructions and manually curated few-shot examples. The judge can return either absolute scores (e.g., PASS, NO PASS) or pairwise preferences (e.g., X is better than Y). Our test set consists of a blend of real and synthetic query-user pairs, designed to cover a wide range of use cases and routing strategies. 

We use LLM-as-a-judge primarily for offline evaluation. For online validation, we rely on user interaction signals such as clicks and streams. To ensure practical viability, we monitor operational guardrails like latency and cost throughout our evaluations. 

Use-case

Improvement

Finding similar artists

+115%

New music releases search 

+91%

Broad music searches 

+25%

Broad podcast search 

+15%

Table 2: Offline relative improvements over regular search across different use-cases evaluated by LLM-as-a-judge. 

Results 

We deployed and tested this approach in a major English-speaking country with millions of active users, showing that it can handle production traffic while maintaining standard latency levels for production systems (∼450 ms p75). This confirms both the scalability and cost-feasibility of the approach. From a qualitative perspective, our LLM-as-a-judge evaluations shows substantial improvements in search across multiple use-cases (Table 2), such as finding similar artists to a given one (“find artists similar to X”), broad podcast searches (“english lessons for spanish speakers podcast”), searching for new music releases (“new albums by lady gaga”), broad music searches (“morning motivation with my favorite upbeat tracks“) (Table 2). 

As we continue to iterate and expand on this technology, we also observed online gains in specific use-cases (+3% successes in searching for new music releases). Based on these findings, we have rolled out the system to several platforms that use the Spotify search experience and in the main app, and we keep working to expand its scope to more use-cases and markets.

For more information, please refer to our paper:  You Say Search, I Say Recs: A Scalable Agentic Approach to Query Understanding and Exploratory Search at Spotify  Enrico Palumbo, Marcus Isaksson, Alexandre Tamborrino, Maria Movin, Catalin Dincu, Ali Vardasbi, Lev Nikeshkin, Oksana Gorobets, Anders Nyman, Poppy Newdick, Hugues Bouchard, Paul Bennett, Mounia Lalmas, Dani Doro, Christine Doig Cardet, Ziad Sultan RecSys 2025 (Industry)

References 

[1] Avishek Anand, Abhijit Anand, Vinay Setty, et al. Query understanding in the age of large language models. arXiv:2306.16004, 2023.  [2] Tom Brown, Benjamin Mann, Nick Ryder, Melanie Subbiah, Jared D Kaplan, Prafulla Dhariwal, Arvind Neelakantan, Pranav Shyam, Girish Sastry, Amanda Askell, et al. Language models are few-shot learners. NeurIPS 2020.  [3] Ang Li, Jennifer Thom, Praveen Chandar, Christine Hosey, Brian St. Thomas, and Jean Garcia-Gathright. Search Mindsets: Understanding Focused and Non Focused Information Seeking in Music Search. ACM Web Conference, 2019. [4] Josef Pichlmeier, Philipp Ross, and Andre Luckow. 2024. Performance characterization of expert router for scalable LLM inference. IEEE BigData, 2024. [5] Timo Schick, Jane Dwivedi-Yu, Roberto Dessì, Roberta Raileanu, Maria Lomeli, Eric Hambro, Luke Zettlemoyer, Nicola Cancedda, and Thomas Scialom. Toolformer: Language models can teach themselves to use tools. NeurIPS 2023.  [6] Chengwei Su, Rahul Gupta, Shankar Ananthakrishnan, and Spyros Matsoukas. A re-ranker scheme for integrating large scale nlu models. IEEE Spoken Language Technology Workshop 2018. [7] Paul Thomas, Seth Spielman, Nick Craswell, and Bhaskar Mitra. Large language models can accurately predict searcher preferences. SIGIR 2024. [8] Rui Yang, Lin Song, Yanwei Li, Sijie Zhao, Yixiao Ge, Xiu Li, and Ying Shan. Gpt4tools: Teaching large language model to use tools via self-instruction. NeurIPS 2023.  [9] Shunyu Yao, Jeffrey Zhao, Dian Yu, Nan Du, Izhak Shafran, Karthik Narasimhan, and Yuan Cao. 2023. React: Synergizing reasoning and acting in language models. ICLR 2023. [10] Zheng Yuan, Hongyi Yuan, Chengpeng Li, Guanting Dong, Keming Lu, Chuanqi Tan, Chang Zhou, and Jingren Zhou.Scaling relationship on learning mathematical reasoning with large language models. arXiv:2308.01825, 2023.