Table of Contents
Fetching ...

Retro*: Optimizing LLMs for Reasoning-Intensive Document Retrieval

Junwei Lan, Jianlyu Chen, Zheng Liu, Chaofan Li, Siqi Bao, Defu Lian

TL;DR

Retro* tackles reasoning-intensive document retrieval for RAG by introducing rubric-based relevance scoring and test-time score integration, enabling interpretable absolute relevance estimates. A two-stage training pipeline (SFT warm-up and RL with composite intra- and inter-document rewards via GRPO) optimizes both per-document scoring and group ranking. Evaluations on BRIGHT show state-of-the-art $nDCG@10$ across 7B and 32B backbones, with further gains from test-time sampling and robust cross-retriever performance; BEIR results demonstrate strong generalization to traditional IR tasks. The approach achieves scalable parallelism and efficiency, providing a practical, reasoning-enabled retrieval framework for real-world RAG systems.

Abstract

With the growing popularity of LLM agents and RAG, it has become increasingly important to retrieve documents that are essential for solving a task, even when their connection to the task is indirect or implicit. Addressing this problem requires fine-grained reasoning to accurately assess the relevance between the task and each candidate document. This capability, however, poses a significant challenge for existing IR techniques. Despite recent progress in reasoning-enhanced IR, existing approaches still face significant challenges in applicability, scalability, and efficiency. In this work, we propose Retro*, a novel approach for reasoning-intensive document retrieval. Our method introduces a rubric-based relevance scoring mechanism, enabling the model to reason about the relationship between a task and a document based on explicitly defined criteria, whereby producing a fine-grained, interpretable relevance score. Retro* also supports test-time scaling by combining multiple reasoning trajectories via score integration, which produces more reliable relevance estimates. To optimize Retro*'s reasoning capabilities, we introduce a novel reinforcement learning algorithm tailored for its relevance scoring mechanism, which employs two composite rewards to fully exploit the trajectories of each training sample. Our experiments show that Retro* outperforms existing document retrieval methods with notable advantages, leading to state-of-the-art performance on the BRIGHT benchmark.

Retro*: Optimizing LLMs for Reasoning-Intensive Document Retrieval

TL;DR

Retro* tackles reasoning-intensive document retrieval for RAG by introducing rubric-based relevance scoring and test-time score integration, enabling interpretable absolute relevance estimates. A two-stage training pipeline (SFT warm-up and RL with composite intra- and inter-document rewards via GRPO) optimizes both per-document scoring and group ranking. Evaluations on BRIGHT show state-of-the-art across 7B and 32B backbones, with further gains from test-time sampling and robust cross-retriever performance; BEIR results demonstrate strong generalization to traditional IR tasks. The approach achieves scalable parallelism and efficiency, providing a practical, reasoning-enabled retrieval framework for real-world RAG systems.

Abstract

With the growing popularity of LLM agents and RAG, it has become increasingly important to retrieve documents that are essential for solving a task, even when their connection to the task is indirect or implicit. Addressing this problem requires fine-grained reasoning to accurately assess the relevance between the task and each candidate document. This capability, however, poses a significant challenge for existing IR techniques. Despite recent progress in reasoning-enhanced IR, existing approaches still face significant challenges in applicability, scalability, and efficiency. In this work, we propose Retro*, a novel approach for reasoning-intensive document retrieval. Our method introduces a rubric-based relevance scoring mechanism, enabling the model to reason about the relationship between a task and a document based on explicitly defined criteria, whereby producing a fine-grained, interpretable relevance score. Retro* also supports test-time scaling by combining multiple reasoning trajectories via score integration, which produces more reliable relevance estimates. To optimize Retro*'s reasoning capabilities, we introduce a novel reinforcement learning algorithm tailored for its relevance scoring mechanism, which employs two composite rewards to fully exploit the trajectories of each training sample. Our experiments show that Retro* outperforms existing document retrieval methods with notable advantages, leading to state-of-the-art performance on the BRIGHT benchmark.

Paper Structure

This paper contains 28 sections, 2 equations, 8 figures, 15 tables.

Figures (8)

  • Figure 1: Relevance rubric for Retro*. The Relevance Placeholder allows users to specify the definition of relevance, while a 5-level criteria ensures consistent and interpretable scoring result.
  • Figure 2: An overview of the two-stage training. SFT: the model is warmed up with a filtered data from a powerful teacher model. RL: the model is reinforced with the tailored composite reward.
  • Figure 3: (Left): average performance (nDCG@10) on BRIGHT benchmark. Retro*'s re-ranking performance consistently improves with increased model scale and test-time samples. (Right): inference time on the TheoT. dataset from the BRIGHT benchmark. Retro* exhibits a significantly lower time latency than other methods as the number of candidate documents increases.
  • Figure 4: Score distributions from pointwise models on a sample of positive and negative documents on the BRIGHT benchmark. The intensity of the color represents the density of scores, with darker hues indicating a larger proportion of documents are assigned scores in that range.
  • Figure 5: Training dynamics of Retro* (7B) during the RL stage. (Left): The training reward steadily increases over training steps. (Right): along with the improved reward, the retrieval accuracy (nDCG@10) improves consistently with the training steps.
  • ...and 3 more figures