Zero-Shot Reranking with Large Language Models and Precomputed Ranking Features: Opportunities and Limitations

Abstract

LLMs have been explored for their use in IR as end-to-end rankers, rerankers and assessors. Recently, the exploration of the prompt-and-predict paradigm for reranking in combination with highly performant LLMs have drawn the attention of researchers. Instead of training or fine-tuning a reranker, LLMs are prompted in a zero-shot manner to produce relevance scores, pairwise preferences, or reranked lists. Existing research, though, has been confined to unstructured text corpora, leaving a gap in our understanding: to what extent do the findings of zero-shot LLM rerankers established on plain text corpora hold for datasets containing predominantly precomputed ranking features as is common in industrial settings? We explore this question via an empirical study on one public learning-to-rank dataset (MSLR-WEB10K) and two datasets collected from an audio streaming platform's search logs. Our results paint a differentiated picture: On average, there remains a significant performance gap: prompting the high-capacity LLM GPT-4 results in up to 16% lower NDCG@10 compared to the traditional supervised learning-to-rank (LTR) approach LambdaMART on the public MSLR-WEB10K dataset. However, when focusing only on a subset of hard queries-i.e. queries where the LTR approach ranks a non-relevant document at the top-the zero-shot LLM reranking outperforms the LTR baseline. We confirm the same trends on two proprietary audio search datasets. We also provide insights into prompt design choices and their impact on LLM reranking. We show that LLMs remain brittle, with the same strategies sometimes helping or hurting depending on the model size and dataset.

View publication