Table of Contents
Fetching ...

BLITZRANK: Principled Zero-shot Ranking Agents with Tournament Graphs

Sheshansh Agrawal, Thien Hang Nguyen, Douwe Kiela

TL;DR

BlitzRank addresses the problem of efficiently identifying the top-$m$ items from $n$ using expensive $k$-wise comparisons. It introduces a tournament-graph framework where each $k$-wise query reveals a complete tournament on $k$ items, and transitive closure amplifies information to certify top-$m$ items; non-transitive judgments are handled via SCC-based tiering. The paper proves correctness and termination, derives bounds for top-1 and conjectured bounds for general $m$, and demonstrates Pareto-dominant accuracy-efficiency across 14 benchmarks and 5 LLMs, with substantial token savings compared to baselines. Practically, this yields robust, predictable, and scalable zero-shot ranking for retrieval-augmented generation and related ranking tasks, even under cycles and varying model capacities.

Abstract

Large language models have emerged as powerful zero-shot rerankers for retrieval-augmented generation, offering strong generalization without task-specific training. However, existing LLM reranking methods either rely on heuristics that fail to fully exploit the information revealed by each ranking decision or are inefficient when they do. We introduce a tournament graph framework that provides a principled foundation for $k$-wise reranking. Our key observation is that each $k$-document comparison reveals a complete tournament of $\binom{k}{2}$ pairwise preferences. These tournaments are aggregated into a global preference graph, whose transitive closure yields many additional orderings without further model invocations. We formalize when a candidate's rank is certifiably determined and design a query schedule that greedily maximizes information gain towards identifying the top-$m$ items. Our framework also gracefully handles non-transitive preferences - cycles induced by LLM judgments - by collapsing them into equivalence classes that yield principled tiered rankings. Empirically, across 14 benchmarks and 5 LLMs, our method achieves Pareto dominance over existing methods: matching or exceeding accuracy while requiring 25-40% fewer tokens than comparable approaches, and 7$\times$ fewer than pairwise methods at near-identical quality.

BLITZRANK: Principled Zero-shot Ranking Agents with Tournament Graphs

TL;DR

BlitzRank addresses the problem of efficiently identifying the top- items from using expensive -wise comparisons. It introduces a tournament-graph framework where each -wise query reveals a complete tournament on items, and transitive closure amplifies information to certify top- items; non-transitive judgments are handled via SCC-based tiering. The paper proves correctness and termination, derives bounds for top-1 and conjectured bounds for general , and demonstrates Pareto-dominant accuracy-efficiency across 14 benchmarks and 5 LLMs, with substantial token savings compared to baselines. Practically, this yields robust, predictable, and scalable zero-shot ranking for retrieval-augmented generation and related ranking tasks, even under cycles and varying model capacities.

Abstract

Large language models have emerged as powerful zero-shot rerankers for retrieval-augmented generation, offering strong generalization without task-specific training. However, existing LLM reranking methods either rely on heuristics that fail to fully exploit the information revealed by each ranking decision or are inefficient when they do. We introduce a tournament graph framework that provides a principled foundation for -wise reranking. Our key observation is that each -document comparison reveals a complete tournament of pairwise preferences. These tournaments are aggregated into a global preference graph, whose transitive closure yields many additional orderings without further model invocations. We formalize when a candidate's rank is certifiably determined and design a query schedule that greedily maximizes information gain towards identifying the top- items. Our framework also gracefully handles non-transitive preferences - cycles induced by LLM judgments - by collapsing them into equivalence classes that yield principled tiered rankings. Empirically, across 14 benchmarks and 5 LLMs, our method achieves Pareto dominance over existing methods: matching or exceeding accuracy while requiring 25-40% fewer tokens than comparable approaches, and 7 fewer than pairwise methods at near-identical quality.
Paper Structure (64 sections, 33 theorems, 48 equations, 10 figures, 6 tables, 3 algorithms)

This paper contains 64 sections, 33 theorems, 48 equations, 10 figures, 6 tables, 3 algorithms.

Key Result

Lemma 2

Let $G^{*}$ be a transitive tournament. Then for all $v\in V$:

Figures (10)

  • Figure 1: A $k$-wise oracle query on $n{=}6$ candidates. Left: A query set $S$ of $k{=}3$ candidates (shaded) is selected. Right: The oracle returns a tournament on $S$, revealing $\binom{3}{2}{=}3$ new edges (blue). Combined with prior edges (gray), additional preferences are inferred transitively (orange dashed).
  • Figure 2: Illustration of Algorithm \ref{['alg:tournament-sort-main']} achieving the optimal 7 rounds on the classic 25 horses puzzle, where $(n,k,m)=(25,5,3)$. Each node shows the horse ID with $L(u)$\ref{['eq:in-reach']} at bottom left and $W(u)$\ref{['eq:out-reach']} at bottom right. Blue nodes indicate horses queried in that round. In this transitive instance, $K(u)=L(u)+W(u)$ and double circles indicate finalized horses (where $K(u)=24$). Note: For reproducibility, the initial ordering was generated with Python's random.shuffle on $[1,2,\dots,25]$ with seed=42. The first five rounds are grouped as shown.
  • Figure 3: Tournament graphs with $n{=}6$ candidates. Top (transitive): Consistent preferences yield a total ordering; each node has a unique rank determined by $L(u)$, the number of nodes that beat it. Bottom (non-transitive): A cycle $b \succ c \succ d \succ b$ forms a strongly connected component (orange). Nodes in the SCC share the same tier since no consistent ordering exists among them, but the partial order $a \succ \{b,c,d\} \succ e \succ f$ is still recovered.
  • Figure 4: Pareto frontiers showing the accuracy-efficiency trade-off across LLM oracles. BlitzRank (Algorithm \ref{['alg:tournament-sort-main']}) consistently occupies the upper-left region, achieving competitive accuracy with 25--40% fewer tokens than methods with comparable structure (§\ref{['sec:efficiency-details']}).
  • Figure 5: Evolution of SCCs on DL19 with GPT-4.1. Solid lines and shaded regions show means and variance across queries, respectively. (a) Both $k$'s begin with 100 singleton SCCs. $k{=}20$ forms cycles earlier (rounds 5--7), ending with $\sim$85 SCCs. $k{=}10$ forms fewer cycles and also later. (b) Average SCC size follows a similar pattern: $k{=}20$ reaches 1.18 average size by round 7, while $k{=}10$ reaches only 1.07 by round 15.
  • ...and 5 more figures

Theorems & Definitions (81)

  • Remark 1
  • Lemma 2: Discovered In-Reach Lower Bounds True In-Degree
  • proof
  • Lemma 3: Discovered Ranks Underestimate True Ranks
  • proof
  • Corollary 4: Elimination Criterion
  • proof
  • Proposition 5: Top-$j$ Finalization Criterion
  • proof
  • Definition 6: Finalization Threshold $m_{t}$, Finalized Set $\text{TOP}_{t}$, and Candidate Set $C_{t}$
  • ...and 71 more