Table of Contents
Fetching ...

Diagnosing LLM Reranker Behavior Under Fixed Evidence Pools

Baris Arat, Emre Sefer

TL;DR

A controlled diagnostic is introduced that isolates reranking by using Multi-News clusters as fixed evidence pools and is model-agnostic and applicable to any ranker, including open source systems and proprietary APIs.

Abstract

Standard reranking evaluations study how a reranker orders candidates returned by an upstream retriever. This setup couples ranking behavior with retrieval quality, so differences in output cannot be attributed to the ranking policy alone. We introduce a controlled diagnostic that isolates reranking by using Multi-News clusters as fixed evidence pools. We limit each pool to exactly eight documents and pass identical inputs to all rankers. Within this setup, BM25 and MMR serve as interpretable reference points for lexical matching and diversity optimization. Across 345 clusters, we find that redundancy patterns vary by model: one LLM implicitly diversifies at larger selection budgets, while another increases redundancy. In contrast, LLMs underperform on lexical coverage at small selection budgets. As a result, LLM rankings diverge substantially from both baselines rather than consistently approximating either strategy. By eliminating retrieval variance, we can attribute these differences directly to the ranking policy. This diagnostic is model-agnostic and applicable to any ranker, including open source systems and proprietary APIs.

Diagnosing LLM Reranker Behavior Under Fixed Evidence Pools

TL;DR

A controlled diagnostic is introduced that isolates reranking by using Multi-News clusters as fixed evidence pools and is model-agnostic and applicable to any ranker, including open source systems and proprietary APIs.

Abstract

Standard reranking evaluations study how a reranker orders candidates returned by an upstream retriever. This setup couples ranking behavior with retrieval quality, so differences in output cannot be attributed to the ranking policy alone. We introduce a controlled diagnostic that isolates reranking by using Multi-News clusters as fixed evidence pools. We limit each pool to exactly eight documents and pass identical inputs to all rankers. Within this setup, BM25 and MMR serve as interpretable reference points for lexical matching and diversity optimization. Across 345 clusters, we find that redundancy patterns vary by model: one LLM implicitly diversifies at larger selection budgets, while another increases redundancy. In contrast, LLMs underperform on lexical coverage at small selection budgets. As a result, LLM rankings diverge substantially from both baselines rather than consistently approximating either strategy. By eliminating retrieval variance, we can attribute these differences directly to the ranking policy. This diagnostic is model-agnostic and applicable to any ranker, including open source systems and proprietary APIs.
Paper Structure (8 sections, 6 equations, 3 figures, 3 tables)

This paper contains 8 sections, 6 equations, 3 figures, 3 tables.

Figures (3)

  • Figure 1: Experimental setup with fixed evidence pools.
  • Figure 2: Mean coverage and redundancy at $K=3$ with 95% CIs.
  • Figure 3: Redundancy and coverage deltas (MMR minus LLM) for lexical (top) and semantic (bottom) metrics.