Table of Contents
Fetching ...

Less is More: Compact Clue Selection for Efficient Retrieval-Augmented Generation Reasoning

Qianchi Zhang, Hainan Zhang, Liang Pang, Hongwei Zheng, Yongxin Tong, Zhiming Zheng

TL;DR

This work tackles the inefficiency of human-optimized RAG retrievers for LLM reasoning by proposing CompSelect, a compact clue selection framework. It casts retrieval as a MinMax problem and decomposes the solution into three modules: a KNN-based clue extractor that gathers answer-containing and similar sentences, a pairwise-trained clue reranker that orders clues by usefulness, and an adaptive truncator that retains the minimal sufficient context for correct answers. Empirical results on NQ, TriviaQA, and HotpotQA across LLaMA3-8B-Instruct and Qwen3-14B show that CompSelect improves SubEM and F1 while achieving substantial reductions in both total and online latency, and it demonstrates robustness to unreliable retrieval and cross-task generalization. The approach delivers a scalable, cost-efficient RAG solution that better matches the information needs of LLM-based reasoning, with promising avenues for integrating generative retrieval to further reduce end-to-end latency.

Abstract

Current RAG retrievers are designed primarily for human readers, emphasizing complete, readable, and coherent paragraphs. However, LLMs benefit more from precise, compact, and well-structured input, which enhances reasoning quality and efficiency. Existing methods often rely on reranking or summarization to identify key sentences, but may suffer from semantic breaks and unfaithfulness. Thus, efficiently extracting and organizing answer-relevant clues from large-scale documents while reducing LLM reasoning costs remains a challenge for RAG. Inspired by Occam's razor, we frame LLM-centric retrieval as a MinMax optimization: maximizing the extraction of potential clues and reranking them for well-organization, while minimizing reasoning costs by truncating to the smallest sufficient clues set. In this paper, we propose CompSelect, a Compact clue Selection mechanism for LLM-centric RAG, consisting of a clue extractor, a reranker, and a truncator. (1) The clue extractor first uses answer-containing sentences as fine-tuning targets, aiming to extract sufficient potential clues; (2) The reranker is trained to prioritize effective clues based on real LLM feedback; (3) The truncator uses the truncated text containing the minimum sufficient clues for answering the question as fine-tuning targets, thereby enabling efficient RAG reasoning. Experiments on three QA datasets show that CompSelect improves QA performance by approximately 11\% and reduces Total Latency and Online Latency by approximately 17\% and 67\% compared to various baseline methods on both LLaMA3 and Qwen3. Further analysis confirms its robustness to unreliable retrieval and generalization across different scenarios, offering a scalable and cost-efficient solution for web-scale RAG applications.

Less is More: Compact Clue Selection for Efficient Retrieval-Augmented Generation Reasoning

TL;DR

This work tackles the inefficiency of human-optimized RAG retrievers for LLM reasoning by proposing CompSelect, a compact clue selection framework. It casts retrieval as a MinMax problem and decomposes the solution into three modules: a KNN-based clue extractor that gathers answer-containing and similar sentences, a pairwise-trained clue reranker that orders clues by usefulness, and an adaptive truncator that retains the minimal sufficient context for correct answers. Empirical results on NQ, TriviaQA, and HotpotQA across LLaMA3-8B-Instruct and Qwen3-14B show that CompSelect improves SubEM and F1 while achieving substantial reductions in both total and online latency, and it demonstrates robustness to unreliable retrieval and cross-task generalization. The approach delivers a scalable, cost-efficient RAG solution that better matches the information needs of LLM-based reasoning, with promising avenues for integrating generative retrieval to further reduce end-to-end latency.

Abstract

Current RAG retrievers are designed primarily for human readers, emphasizing complete, readable, and coherent paragraphs. However, LLMs benefit more from precise, compact, and well-structured input, which enhances reasoning quality and efficiency. Existing methods often rely on reranking or summarization to identify key sentences, but may suffer from semantic breaks and unfaithfulness. Thus, efficiently extracting and organizing answer-relevant clues from large-scale documents while reducing LLM reasoning costs remains a challenge for RAG. Inspired by Occam's razor, we frame LLM-centric retrieval as a MinMax optimization: maximizing the extraction of potential clues and reranking them for well-organization, while minimizing reasoning costs by truncating to the smallest sufficient clues set. In this paper, we propose CompSelect, a Compact clue Selection mechanism for LLM-centric RAG, consisting of a clue extractor, a reranker, and a truncator. (1) The clue extractor first uses answer-containing sentences as fine-tuning targets, aiming to extract sufficient potential clues; (2) The reranker is trained to prioritize effective clues based on real LLM feedback; (3) The truncator uses the truncated text containing the minimum sufficient clues for answering the question as fine-tuning targets, thereby enabling efficient RAG reasoning. Experiments on three QA datasets show that CompSelect improves QA performance by approximately 11\% and reduces Total Latency and Online Latency by approximately 17\% and 67\% compared to various baseline methods on both LLaMA3 and Qwen3. Further analysis confirms its robustness to unreliable retrieval and generalization across different scenarios, offering a scalable and cost-efficient solution for web-scale RAG applications.

Paper Structure

This paper contains 40 sections, 9 equations, 7 figures, 12 tables.

Figures (7)

  • Figure 1: SubEM performance and latency on the NQ test set for different sizes of the LLaMA3 series, including the reranking method RECOMP-extr and the abstractive method Refiner.
  • Figure 2: An illustration of the challenge in locating accurate answer clues. While baselines RECOMP and RichRAG select an incorrect clue from the first document, our method identifies the correct clue from the fourth via extraction, reranking, and truncation.
  • Figure 3: The architecture of CompSelect consists of three modules: a clue extractor, a reranker, and a truncator. The top shows their training strategies and annotated data, and the bottom illustrates the inference process of compact clue selection.
  • Figure 4: An illustration of clue extraction performance using LLaMA3-8B-Instruct as generator on NQ, TriviaQA, and HotpotQA test sets. The x-axis shows the KNN threshold and higher values introduce more contextual sentences.
  • Figure 5: Latency comparison across baselines and our method. The top shows Total Latency along with SubEM performance, while the bottom shows Online Latency. Experiments were conducted on two NVIDIA RTX PRO 6000 GPUs.
  • ...and 2 more figures