BLITZRANK: Principled Zero-shot Ranking Agents with Tournament Graphs
Sheshansh Agrawal, Thien Hang Nguyen, Douwe Kiela
TL;DR
BlitzRank addresses the problem of efficiently identifying the top-$m$ items from $n$ using expensive $k$-wise comparisons. It introduces a tournament-graph framework where each $k$-wise query reveals a complete tournament on $k$ items, and transitive closure amplifies information to certify top-$m$ items; non-transitive judgments are handled via SCC-based tiering. The paper proves correctness and termination, derives bounds for top-1 and conjectured bounds for general $m$, and demonstrates Pareto-dominant accuracy-efficiency across 14 benchmarks and 5 LLMs, with substantial token savings compared to baselines. Practically, this yields robust, predictable, and scalable zero-shot ranking for retrieval-augmented generation and related ranking tasks, even under cycles and varying model capacities.
Abstract
Large language models have emerged as powerful zero-shot rerankers for retrieval-augmented generation, offering strong generalization without task-specific training. However, existing LLM reranking methods either rely on heuristics that fail to fully exploit the information revealed by each ranking decision or are inefficient when they do. We introduce a tournament graph framework that provides a principled foundation for $k$-wise reranking. Our key observation is that each $k$-document comparison reveals a complete tournament of $\binom{k}{2}$ pairwise preferences. These tournaments are aggregated into a global preference graph, whose transitive closure yields many additional orderings without further model invocations. We formalize when a candidate's rank is certifiably determined and design a query schedule that greedily maximizes information gain towards identifying the top-$m$ items. Our framework also gracefully handles non-transitive preferences - cycles induced by LLM judgments - by collapsing them into equivalence classes that yield principled tiered rankings. Empirically, across 14 benchmarks and 5 LLMs, our method achieves Pareto dominance over existing methods: matching or exceeding accuracy while requiring 25-40% fewer tokens than comparable approaches, and 7$\times$ fewer than pairwise methods at near-identical quality.
