Table of Contents
Fetching ...

Search Wisely: Mitigating Sub-optimal Agentic Searches By Reducing Uncertainty

Peilin Wu, Mian Zhang, Xinlu Zhang, Xinya Du, Zhiyu Zoey Chen

TL;DR

This work formalizes sub-optimal search behaviors in agentic RAG systems as over-search and under-search, showing their prevalence across multi-hop QA tasks. It introduces $\beta$-GRPO, a confidence-aware reinforcement learning method that rewards high-certainty search decisions by tying rewards to the minimum probability among search tokens, improving search reliability. Across seven QA benchmarks, a 3B model with $\beta$-GRPO achieves around a $4\%$ increase in average exact-match accuracy and reduces both over-search and under-search rates (roughly $1.21\%$ and $7.33\%$, respectively) compared with strong baselines, indicating better knowledge-boundary awareness. The results highlight the importance of uncertainty-aware search control for robust agentic RAG and point to scalable directions for future work on larger models and more nuanced rewards.

Abstract

Agentic Retrieval-Augmented Generation (RAG) systems enhance Large Language Models (LLMs) by enabling dynamic, multi-step reasoning and information retrieval. However, these systems often exhibit sub-optimal search behaviors like over-search (retrieving redundant information) and under-search (failing to retrieve necessary information), which hinder efficiency and reliability. This work formally defines and quantifies these behaviors, revealing their prevalence across multiple QA datasets and agentic RAG systems (e.g., one model could have avoided searching in 27.7% of its search steps). Furthermore, we demonstrate a crucial link between these inefficiencies and the models' uncertainty regarding their own knowledge boundaries, where response accuracy correlates with model's uncertainty in its search decisions. To address this, we propose $β$-GRPO, a reinforcement learning-based training method that incorporates confidence threshold to reward high-certainty search decisions. Experiments on seven QA benchmarks show that $β$-GRPO enable a 3B model with better agentic RAG ability, outperforming other strong baselines with a 4% higher average exact match score.

Search Wisely: Mitigating Sub-optimal Agentic Searches By Reducing Uncertainty

TL;DR

This work formalizes sub-optimal search behaviors in agentic RAG systems as over-search and under-search, showing their prevalence across multi-hop QA tasks. It introduces -GRPO, a confidence-aware reinforcement learning method that rewards high-certainty search decisions by tying rewards to the minimum probability among search tokens, improving search reliability. Across seven QA benchmarks, a 3B model with -GRPO achieves around a increase in average exact-match accuracy and reduces both over-search and under-search rates (roughly and , respectively) compared with strong baselines, indicating better knowledge-boundary awareness. The results highlight the importance of uncertainty-aware search control for robust agentic RAG and point to scalable directions for future work on larger models and more nuanced rewards.

Abstract

Agentic Retrieval-Augmented Generation (RAG) systems enhance Large Language Models (LLMs) by enabling dynamic, multi-step reasoning and information retrieval. However, these systems often exhibit sub-optimal search behaviors like over-search (retrieving redundant information) and under-search (failing to retrieve necessary information), which hinder efficiency and reliability. This work formally defines and quantifies these behaviors, revealing their prevalence across multiple QA datasets and agentic RAG systems (e.g., one model could have avoided searching in 27.7% of its search steps). Furthermore, we demonstrate a crucial link between these inefficiencies and the models' uncertainty regarding their own knowledge boundaries, where response accuracy correlates with model's uncertainty in its search decisions. To address this, we propose -GRPO, a reinforcement learning-based training method that incorporates confidence threshold to reward high-certainty search decisions. Experiments on seven QA benchmarks show that -GRPO enable a 3B model with better agentic RAG ability, outperforming other strong baselines with a 4% higher average exact match score.

Paper Structure

This paper contains 17 sections, 2 equations, 4 figures, 4 tables.

Figures (4)

  • Figure 1: Flowchart of analysis pipeline for over-search and under-search.
  • Figure 2: Percentage for all search steps that can be answered without performing searches of R1-Searcher and Search-R1 on 4 datasets combined, with respect to the number of searches of each test sample.
  • Figure 3: Error rate for all non-search steps of R1-Searcher and Search-R1 on 4 datasets combined, with respect to the number of searches of each test sample.
  • Figure 4: Training Rewards for Search-R1-GRPO and Search-R1-$\beta$-GRPO.