Table of Contents
Fetching ...

Rethinking Reasoning in Document Ranking: Why Chain-of-Thought Falls Short

Xuan Lu, Haohang Huang, Rui Meng, Yaohui Jin, Wenjun Zeng, Xiaoyu Shen

TL;DR

This paper interrogates whether explicit chain-of-thought reasoning improves document reranking. Through a comprehensive, apples-to-apples comparison of pointwise and listwise rerankers under supervised fine-tuning and GRPO reinforcement learning, the authors evaluate on BRIGHT and BEIR benchmarks using MS MARCO as the training corpus. Across all configurations, reasoning-augmented rerankers consistently underperform direct-output rerankers and incur higher inference costs, with calibration breakdown and increased variance identified as core failure modes. The findings challenge the assumption that explicit reasoning universally benefits reranking and point to calibration-aware scoring and more concise reasoning strategies as promising directions for future work.

Abstract

Document reranking is a key component in information retrieval (IR), aimed at refining initial retrieval results to improve ranking quality for downstream tasks. Recent studies--motivated by large reasoning models (LRMs)--have begun incorporating explicit chain-of-thought (CoT) reasoning into LLM-based rerankers. However, the effectiveness of such reasoning for ranking tasks remains underexplored. In this work, we present the first systematic study of reasoning in reranking across both pointwise and listwise settings, under both supervised fine-tuning and reinforcement learning. Using diverse benchmarks, including reasoning-intensive datasets (BRIGHT) and standard IR benchmarks (BEIR), we find that reasoning-augmented rerankers consistently underperform their direct counterparts that predict rankings without CoT, despite substantially higher inference costs. Our analysis reveals three core limitations: (i) in pointwise rerankers, reasoning breaks calibration and biases models toward the positive class, raising TPR but lowering TNR, which inflates false positives and degrades ranking in negative-dominant pools; (ii) in listwise rerankers, reasoning improves in-domain fit but increases variance and fails to generalize out-of-domain, even when reinforcement learning shortens rationales; and (iii) overall, directly fine-tuned rerankers remain more stable, effective, and robust. These findings challenge the assumption that explicit reasoning is universally beneficial for reranking. We conclude by highlighting future directions, including calibration-aware scoring for pointwise rerankers and the design of concise, targeted reasoning strategies to mitigate overfitting and overthinking in listwise rerankers.

Rethinking Reasoning in Document Ranking: Why Chain-of-Thought Falls Short

TL;DR

This paper interrogates whether explicit chain-of-thought reasoning improves document reranking. Through a comprehensive, apples-to-apples comparison of pointwise and listwise rerankers under supervised fine-tuning and GRPO reinforcement learning, the authors evaluate on BRIGHT and BEIR benchmarks using MS MARCO as the training corpus. Across all configurations, reasoning-augmented rerankers consistently underperform direct-output rerankers and incur higher inference costs, with calibration breakdown and increased variance identified as core failure modes. The findings challenge the assumption that explicit reasoning universally benefits reranking and point to calibration-aware scoring and more concise reasoning strategies as promising directions for future work.

Abstract

Document reranking is a key component in information retrieval (IR), aimed at refining initial retrieval results to improve ranking quality for downstream tasks. Recent studies--motivated by large reasoning models (LRMs)--have begun incorporating explicit chain-of-thought (CoT) reasoning into LLM-based rerankers. However, the effectiveness of such reasoning for ranking tasks remains underexplored. In this work, we present the first systematic study of reasoning in reranking across both pointwise and listwise settings, under both supervised fine-tuning and reinforcement learning. Using diverse benchmarks, including reasoning-intensive datasets (BRIGHT) and standard IR benchmarks (BEIR), we find that reasoning-augmented rerankers consistently underperform their direct counterparts that predict rankings without CoT, despite substantially higher inference costs. Our analysis reveals three core limitations: (i) in pointwise rerankers, reasoning breaks calibration and biases models toward the positive class, raising TPR but lowering TNR, which inflates false positives and degrades ranking in negative-dominant pools; (ii) in listwise rerankers, reasoning improves in-domain fit but increases variance and fails to generalize out-of-domain, even when reinforcement learning shortens rationales; and (iii) overall, directly fine-tuned rerankers remain more stable, effective, and robust. These findings challenge the assumption that explicit reasoning is universally beneficial for reranking. We conclude by highlighting future directions, including calibration-aware scoring for pointwise rerankers and the design of concise, targeted reasoning strategies to mitigate overfitting and overthinking in listwise rerankers.

Paper Structure

This paper contains 37 sections, 9 equations, 10 figures, 7 tables.

Figures (10)

  • Figure 1: Illustration of Pointwise and Listwise Reranking (Direct vs. Reasoning). In pointwise, each query--document pair is judged independently, with relevance scores computed as the normalized probability of the TRUE token over $\{\texttt{TRUE}, \texttt{FALSE}\}$ logits. Listwise directly optimizes the ranking order over candidate sets, with or without explicit reasoning.
  • Figure 2: Calibration curves of pointwise rerankers: predicted probabilities vs. empirical accuracies.
  • Figure 3: Training-split listwise performance of four 8B variants. Reasoning improves mean NDCG@10 but increases variance.
  • Figure 4: Prompt template for pointwise relevance judgement.
  • Figure 5: Prompt template for pointwise relevance judgement (non-reasoning version). The <think> tag is kept empty to maintain consistency with reasoning prompts.
  • ...and 5 more figures