Rethinking Reasoning in Document Ranking: Why Chain-of-Thought Falls Short
Xuan Lu, Haohang Huang, Rui Meng, Yaohui Jin, Wenjun Zeng, Xiaoyu Shen
TL;DR
This paper interrogates whether explicit chain-of-thought reasoning improves document reranking. Through a comprehensive, apples-to-apples comparison of pointwise and listwise rerankers under supervised fine-tuning and GRPO reinforcement learning, the authors evaluate on BRIGHT and BEIR benchmarks using MS MARCO as the training corpus. Across all configurations, reasoning-augmented rerankers consistently underperform direct-output rerankers and incur higher inference costs, with calibration breakdown and increased variance identified as core failure modes. The findings challenge the assumption that explicit reasoning universally benefits reranking and point to calibration-aware scoring and more concise reasoning strategies as promising directions for future work.
Abstract
Document reranking is a key component in information retrieval (IR), aimed at refining initial retrieval results to improve ranking quality for downstream tasks. Recent studies--motivated by large reasoning models (LRMs)--have begun incorporating explicit chain-of-thought (CoT) reasoning into LLM-based rerankers. However, the effectiveness of such reasoning for ranking tasks remains underexplored. In this work, we present the first systematic study of reasoning in reranking across both pointwise and listwise settings, under both supervised fine-tuning and reinforcement learning. Using diverse benchmarks, including reasoning-intensive datasets (BRIGHT) and standard IR benchmarks (BEIR), we find that reasoning-augmented rerankers consistently underperform their direct counterparts that predict rankings without CoT, despite substantially higher inference costs. Our analysis reveals three core limitations: (i) in pointwise rerankers, reasoning breaks calibration and biases models toward the positive class, raising TPR but lowering TNR, which inflates false positives and degrades ranking in negative-dominant pools; (ii) in listwise rerankers, reasoning improves in-domain fit but increases variance and fails to generalize out-of-domain, even when reinforcement learning shortens rationales; and (iii) overall, directly fine-tuned rerankers remain more stable, effective, and robust. These findings challenge the assumption that explicit reasoning is universally beneficial for reranking. We conclude by highlighting future directions, including calibration-aware scoring for pointwise rerankers and the design of concise, targeted reasoning strategies to mitigate overfitting and overthinking in listwise rerankers.
