AcuRank: Uncertainty-Aware Adaptive Computation for Listwise Reranking
Soyoung Yoon, Gyuwan Kim, Gyu-Hwung Cho, Seung-won Hwang
TL;DR
AcuRank tackles the high cost of LLM-based listwise reranking under context constraints by introducing uncertainty-aware adaptive computation, guided by a Bayesian TrueSkill model to maintain probabilistic document relevance.The method iteratively refines uncertain documents: initializing with first-stage scores, estimating top-k probabilities via a thresholded latent-score mechanism, and selectively reranking only ambiguous candidates until confidence stabilizes.Empirical results on TREC-DL and BEIR show AcuRank achieves superior accuracy-efficiency trade-offs across diverse retrievers and rerankers, with scalable compute and robust generalization to out-of-domain settings.The framework provides a flexible anytime approach, allowing practitioners to trade off accuracy and compute, while offering avenues for future work in richer uncertainty signals and reasoning-aware retrieval.
Abstract
Listwise reranking with large language models (LLMs) enhances top-ranked results in retrieval-based applications. Due to the limit in context size and high inference cost of long context, reranking is typically performed over a fixed size of small subsets, with the final ranking aggregated from these partial results. This fixed computation disregards query difficulty and document distribution, leading to inefficiencies. We propose AcuRank, an adaptive reranking framework that dynamically adjusts both the amount and target of computation based on uncertainty estimates over document relevance. Using a Bayesian TrueSkill model, we iteratively refine relevance estimates until reaching sufficient confidence levels, and our explicit modeling of ranking uncertainty enables principled control over reranking behavior and avoids unnecessary updates to confident predictions. Results on the TREC-DL and BEIR benchmarks show that our method consistently achieves a superior accuracy-efficiency trade-off and scales better with compute than fixed-computation baselines. These results highlight the effectiveness and generalizability of our method across diverse retrieval tasks and LLM-based reranking models.
