Table of Contents
Fetching ...

HiPRAG: Hierarchical Process Rewards for Efficient Agentic Retrieval Augmented Generation

Peilin Wu, Mian Zhang, Kun Wan, Wentian Zhao, Kaiyu He, Xinya Du, Zhiyu Chen

TL;DR

HiPRAG tackles inefficiencies in agentic retrieval-augmented generation by introducing a hierarchical, knowledge-grounded process reward that provides step-level feedback on search decisions. It decomposes reasoning into parsable steps, detects suboptimal searches on-the-fly, and computes a hierarchical reward combining final answer correctness, format adherence, and a process efficiency bonus. Empirical results on Qwen2.5 and Llama-3.2 across seven QA benchmarks show strong accuracy gains and substantial reductions in over-search and under-search rates, with 7B models achieving around 71% Cover Exact Match and low over-search rates. The work demonstrates robust generalization across model families, RL algorithms, and sizes, highlighting the value of fine-grained, process-level supervision for improving both correctness and efficiency in retrieval-augmented reasoning.

Abstract

Agentic RAG is a powerful technique for incorporating external information that LLMs lack, enabling better problem solving and question answering. However, suboptimal search behaviors exist widely, such as over-search (retrieving information already known) and under-search (failing to search when necessary), which leads to unnecessary overhead and unreliable outputs. Current training methods, which typically rely on outcome-based rewards in a RL framework, lack the fine-grained control needed to address these inefficiencies. To overcome this, we introduce Hierarchical Process Rewards for Efficient agentic RAG (HiPRAG), a training methodology that incorporates a fine-grained, knowledge-grounded process reward into the RL training. Our approach evaluates the necessity of each search decision on-the-fly by decomposing the agent's reasoning trajectory into discrete, parsable steps. We then apply a hierarchical reward function that provides an additional bonus based on the proportion of optimal search and non-search steps, on top of commonly used outcome and format rewards. Experiments on the Qwen2.5 and Llama-3.2 models across seven diverse QA benchmarks show that our method achieves average accuracies of 65.4% (3B) and 67.2% (7B). This is accomplished while improving search efficiency, reducing the over-search rate to just 2.3% and concurrently lowering the under-search rate. These results demonstrate the efficacy of optimizing the reasoning process itself, not just the final outcome. Further experiments and analysis demonstrate that HiPRAG shows good generalizability across a wide range of RL algorithms, model families, sizes, and types. This work demonstrates the importance and potential of fine-grained control through RL, for improving the efficiency and optimality of reasoning for search agents.

HiPRAG: Hierarchical Process Rewards for Efficient Agentic Retrieval Augmented Generation

TL;DR

HiPRAG tackles inefficiencies in agentic retrieval-augmented generation by introducing a hierarchical, knowledge-grounded process reward that provides step-level feedback on search decisions. It decomposes reasoning into parsable steps, detects suboptimal searches on-the-fly, and computes a hierarchical reward combining final answer correctness, format adherence, and a process efficiency bonus. Empirical results on Qwen2.5 and Llama-3.2 across seven QA benchmarks show strong accuracy gains and substantial reductions in over-search and under-search rates, with 7B models achieving around 71% Cover Exact Match and low over-search rates. The work demonstrates robust generalization across model families, RL algorithms, and sizes, highlighting the value of fine-grained, process-level supervision for improving both correctness and efficiency in retrieval-augmented reasoning.

Abstract

Agentic RAG is a powerful technique for incorporating external information that LLMs lack, enabling better problem solving and question answering. However, suboptimal search behaviors exist widely, such as over-search (retrieving information already known) and under-search (failing to search when necessary), which leads to unnecessary overhead and unreliable outputs. Current training methods, which typically rely on outcome-based rewards in a RL framework, lack the fine-grained control needed to address these inefficiencies. To overcome this, we introduce Hierarchical Process Rewards for Efficient agentic RAG (HiPRAG), a training methodology that incorporates a fine-grained, knowledge-grounded process reward into the RL training. Our approach evaluates the necessity of each search decision on-the-fly by decomposing the agent's reasoning trajectory into discrete, parsable steps. We then apply a hierarchical reward function that provides an additional bonus based on the proportion of optimal search and non-search steps, on top of commonly used outcome and format rewards. Experiments on the Qwen2.5 and Llama-3.2 models across seven diverse QA benchmarks show that our method achieves average accuracies of 65.4% (3B) and 67.2% (7B). This is accomplished while improving search efficiency, reducing the over-search rate to just 2.3% and concurrently lowering the under-search rate. These results demonstrate the efficacy of optimizing the reasoning process itself, not just the final outcome. Further experiments and analysis demonstrate that HiPRAG shows good generalizability across a wide range of RL algorithms, model families, sizes, and types. This work demonstrates the importance and potential of fine-grained control through RL, for improving the efficiency and optimality of reasoning for search agents.

Paper Structure

This paper contains 38 sections, 3 equations, 8 figures, 6 tables, 3 algorithms.

Figures (8)

  • Figure 1: A general overview of the HiPRAG training workflow. The policy model generates a multi-step reasoning trajectory, and each step is evaluated on-the-fly to detect suboptimal search behaviors. A final hierarchical reward is then computed by combining a process bonus for step optimality with rewards for the final answer's correctness and proper formatting.
  • Figure 2: Reward curves for different RL algorithm and curves of the ratio of searches among all reasoning steps for different model families.
  • Figure 3: Comparison of reasoning trajectory formats for the same multi-hop question. Each logical step is highlighted in a consistent color across both formats to show the correspondence. The actual retrieved documents here are replaced by their summarization to improve readability.
  • Figure 4: Input prompt for generating HiPRAG's parsable output format with the new XML tagging system.
  • Figure 5: Prompt for Over-search Detection
  • ...and 3 more figures