Guiding Retrieval using LLM-based Listwise Rankers
Mandeep Rathee, Sean MacAvaney, Avishek Anand
TL;DR
This work tackles the bounded recall problem in cascaded retrieval when using listwise LLM rerankers by proposing SlideGar, a sliding-window, graph-augmented adaptive retrieval method. SlideGar alternates between the initial retrieved pool and a corpus-graph frontier, using the LLM to rank a window and employing reciprocal rank as a pseudo-score to guide subsequent expansion, thereby mitigating the exclusion of relevant documents not initially retrieved. Across MSMARCO and MSMARCO-passage-v2, with diverse retrievers and rankers, SlideGar yields up to $13.23\%$ improvements in $nDCG@10$ and up to $28.02\%$ in $Recall@c$, while incurring only about $0.02\%$ additional latency relative to standard LLM reranking. This approach enables broader adoption of LLM-based reranking in settings with limited initial results or high first-stage costs, and the authors release their code for public use.
Abstract
Large Language Models (LLMs) have shown strong promise as rerankers, especially in ``listwise'' settings where an LLM is prompted to rerank several search results at once. However, this ``cascading'' retrieve-and-rerank approach is limited by the bounded recall problem: relevant documents not retrieved initially are permanently excluded from the final ranking. Adaptive retrieval techniques address this problem, but do not work with listwise rerankers because they assume a document's score is computed independently from other documents. In this paper, we propose an adaptation of an existing adaptive retrieval method that supports the listwise setting and helps guide the retrieval process itself (thereby overcoming the bounded recall problem for LLM rerankers). Specifically, our proposed algorithm merges results both from the initial ranking and feedback documents provided by the most relevant documents seen up to that point. Through extensive experiments across diverse LLM rerankers, first stage retrievers, and feedback sources, we demonstrate that our method can improve nDCG@10 by up to 13.23% and recall by 28.02%--all while keeping the total number of LLM inferences constant and overheads due to the adaptive process minimal. The work opens the door to leveraging LLM-based search in settings where the initial pool of results is limited, e.g., by legacy systems, or by the cost of deploying a semantic first-stage.
