Table of Contents
Fetching ...

Test-Time Strategies for More Efficient and Accurate Agentic RAG

Brian Zhang, Deepti Guntur, Zhiyang Zuo, Abhinav Sharma, Shreyas Chaudhari, Wenlong Zhao, Franck Dernoncourt, Puneet Mathur, Ryan Rossi, Nedim Lipka

Abstract

Retrieval-Augmented Generation (RAG) systems face challenges with complex, multihop questions, and agentic frameworks such as Search-R1 (Jin et al., 2025), which operates iteratively, have been proposed to address these complexities. However, such approaches can introduce inefficiencies, including repetitive retrieval of previously processed information and challenges in contextualizing retrieved results effectively within the current generation prompt. Such issues can lead to unnecessary retrieval turns, suboptimal reasoning, inaccurate answers, and increased token consumption. In this paper, we investigate test-time modifications to the Search-R1 pipeline to mitigate these identified shortcomings. Specifically, we explore the integration of two components and their combination: a contextualization module to better integrate relevant information from retrieved documents into reasoning, and a de-duplication module that replaces previously retrieved documents with the next most relevant ones. We evaluate our approaches using the HotpotQA (Yang et al., 2018) and the Natural Questions (Kwiatkowski et al., 2019) datasets, reporting the exact match (EM) score, an LLM-as-a-Judge assessment of answer correctness, and the average number of turns. Our best-performing variant, utilizing GPT-4.1-mini for contextualization, achieves a 5.6% increase in EM score and reduces the number of turns by 10.5% compared to the Search-R1 baseline, demonstrating improved answer accuracy and retrieval efficiency.

Test-Time Strategies for More Efficient and Accurate Agentic RAG

Abstract

Retrieval-Augmented Generation (RAG) systems face challenges with complex, multihop questions, and agentic frameworks such as Search-R1 (Jin et al., 2025), which operates iteratively, have been proposed to address these complexities. However, such approaches can introduce inefficiencies, including repetitive retrieval of previously processed information and challenges in contextualizing retrieved results effectively within the current generation prompt. Such issues can lead to unnecessary retrieval turns, suboptimal reasoning, inaccurate answers, and increased token consumption. In this paper, we investigate test-time modifications to the Search-R1 pipeline to mitigate these identified shortcomings. Specifically, we explore the integration of two components and their combination: a contextualization module to better integrate relevant information from retrieved documents into reasoning, and a de-duplication module that replaces previously retrieved documents with the next most relevant ones. We evaluate our approaches using the HotpotQA (Yang et al., 2018) and the Natural Questions (Kwiatkowski et al., 2019) datasets, reporting the exact match (EM) score, an LLM-as-a-Judge assessment of answer correctness, and the average number of turns. Our best-performing variant, utilizing GPT-4.1-mini for contextualization, achieves a 5.6% increase in EM score and reduces the number of turns by 10.5% compared to the Search-R1 baseline, demonstrating improved answer accuracy and retrieval efficiency.
Paper Structure (32 sections, 4 figures, 2 tables)

This paper contains 32 sections, 4 figures, 2 tables.

Figures (4)

  • Figure 1: An illustration of the information flow for our proposed test-time strategies compared to the baseline during a single inference turn $i$. Baseline: This represents the standard Search-R1 framework, where the LLM sends a query ($q_i$) to the retriever and directly receives the retrieved documents ($D_i$) to continue its reasoning. Deduplication: This approach filters out previously seen content and returns only a set of novel documents ($D_i'$) to the LLM. Contextualization: This approach parses the retrieved documents ($D_i$) and reformulates their content to improve integration into the LLM's reasoning process, returning an enhanced set of information ($D_i^*$). Hybrid: This approach combines both modules sequentially.
  • Figure 2: Illustration that questions requiring more agentic turns are inherently more difficult, as shown by the downward trend in Exact Match (EM) score for both the Search-R1 baseline and our Contextualization module. While the Contextualization module achieves a slightly higher mean EM at some points, the overlapping 95% confidence intervals indicate that we are not seeing a statistically significant improvement or difference between the two compared approaches for any given search count.
  • Figure 3: An illustration of the information flow for our proposed test-time strategies compared to the baseline during a single inference turn $i$. Baseline: This represents the standard Search-R1 framework, where the LLM sends a query ($q_i$) to the retriever and directly receives the retrieved documents ($D_i$) to continue its reasoning. Deduplication: This approach filters out previously seen content and returns only a set of novel documents ($D_i'$) to the LLM. Contextualization: This approach parses the retrieved documents ($D_i$) and reformulates their content to improve integration into the LLM's reasoning process, returning an enhanced set of information ($D_i^*$). Hybrid: This approach combines both modules sequentially.
  • Figure 4: Illustration that questions requiring more agentic turns are inherently more difficult, as shown by the downward trend in Exact Match (EM) score for both the Search-R1 baseline and our Contextualization module. While the Contextualization module achieves a slightly higher mean EM at some points, the overlapping 95% confidence intervals indicate that we are not seeing a statistically significant improvement or difference between the two compared approaches for any given search count.