Table of Contents
Fetching ...

Repurposing Synthetic Data for Fine-grained Search Agent Supervision

Yida Zhao, Kuan Li, Xixi Wu, Liwen Zhang, Dingchu Zhang, Baixuan Li, Maojia Song, Zhuo Chen, Chenxi Wang, Xinyu Wang, Kewei Tu, Pengjun Xie, Jingren Zhou, Yong Jiang

TL;DR

This work tackles reward sparsity in GRPO-based training for LLM-powered search agents by repurposing ground-truth entities from synthetic data as a dense, fine-grained reward signal. The authors formalize entity match rate $\gamma_i$ and its normalized form $\hat{\gamma}_i$ as proxies for reasoning quality and introduce Entity-aware Group Relative Policy Optimization (E-GRPO), which provides a non-binary bonus to negative samples proportional to $\hat{\gamma}_i$ with a balancing factor $\alpha$. Empirical results across 11 QA and deep-research benchmarks show that E-GRPO consistently outperforms GRPO in both Local and Web environments and yields more efficient reasoning with fewer tool calls. Analyses reveal a strong association between entity matching and accuracy, and ablations identify a moderate $\alpha$ (around 0.3) as optimal, highlighting the method’s robustness and sample efficiency. Overall, E-GRPO offers a practical, scalable improvement for aligning search agents in knowledge-intensive tasks without additional annotation or sampling overhead.

Abstract

LLM-based search agents are increasingly trained on entity-centric synthetic data to solve complex, knowledge-intensive tasks. However, prevailing training methods like Group Relative Policy Optimization (GRPO) discard this rich entity information, relying instead on sparse, outcome-based rewards. This critical limitation renders them unable to distinguish informative "near-miss" samples-those with substantially correct reasoning but a flawed final answer-from complete failures, thus discarding valuable learning signals. We address this by leveraging the very entities discarded during training. Our empirical analysis reveals a strong positive correlation between the number of ground-truth entities identified during an agent's reasoning process and final answer accuracy. Building on this insight, we introduce Entity-aware Group Relative Policy Optimization (E-GRPO), a novel framework that formulates a dense entity-aware reward function. E-GRPO assigns partial rewards to incorrect samples proportional to their entity match rate, enabling the model to effectively learn from these "near-misses". Experiments on diverse question-answering (QA) and deep research benchmarks show that E-GRPO consistently and significantly outperforms the GRPO baseline. Furthermore, our analysis reveals that E-GRPO not only achieves superior accuracy but also induces more efficient reasoning policies that require fewer tool calls, demonstrating a more effective and sample-efficient approach to aligning search agents.

Repurposing Synthetic Data for Fine-grained Search Agent Supervision

TL;DR

This work tackles reward sparsity in GRPO-based training for LLM-powered search agents by repurposing ground-truth entities from synthetic data as a dense, fine-grained reward signal. The authors formalize entity match rate and its normalized form as proxies for reasoning quality and introduce Entity-aware Group Relative Policy Optimization (E-GRPO), which provides a non-binary bonus to negative samples proportional to with a balancing factor . Empirical results across 11 QA and deep-research benchmarks show that E-GRPO consistently outperforms GRPO in both Local and Web environments and yields more efficient reasoning with fewer tool calls. Analyses reveal a strong association between entity matching and accuracy, and ablations identify a moderate (around 0.3) as optimal, highlighting the method’s robustness and sample efficiency. Overall, E-GRPO offers a practical, scalable improvement for aligning search agents in knowledge-intensive tasks without additional annotation or sampling overhead.

Abstract

LLM-based search agents are increasingly trained on entity-centric synthetic data to solve complex, knowledge-intensive tasks. However, prevailing training methods like Group Relative Policy Optimization (GRPO) discard this rich entity information, relying instead on sparse, outcome-based rewards. This critical limitation renders them unable to distinguish informative "near-miss" samples-those with substantially correct reasoning but a flawed final answer-from complete failures, thus discarding valuable learning signals. We address this by leveraging the very entities discarded during training. Our empirical analysis reveals a strong positive correlation between the number of ground-truth entities identified during an agent's reasoning process and final answer accuracy. Building on this insight, we introduce Entity-aware Group Relative Policy Optimization (E-GRPO), a novel framework that formulates a dense entity-aware reward function. E-GRPO assigns partial rewards to incorrect samples proportional to their entity match rate, enabling the model to effectively learn from these "near-misses". Experiments on diverse question-answering (QA) and deep research benchmarks show that E-GRPO consistently and significantly outperforms the GRPO baseline. Furthermore, our analysis reveals that E-GRPO not only achieves superior accuracy but also induces more efficient reasoning policies that require fewer tool calls, demonstrating a more effective and sample-efficient approach to aligning search agents.

Paper Structure

This paper contains 47 sections, 7 equations, 5 figures, 2 tables.

Figures (5)

  • Figure 1: Left: An Example of entity-centric synthetic data generation. Right: Analysis of the correlation between entity matching and agent performance.
  • Figure 2: Comparison of GRPO and E-GRPO. GRPO applies outcome-based reward, while E-GRPO additionally assigns a bonus to negatives proportional to their normalized entity match rate. The three rollouts illustrate a success, a complete failure, and a "near-miss", respectively.
  • Figure 3: Training dynamics of 30B models with the Web environment, including the comparison of E-GRPO and GRPO over training accuracy, tool call steps, and the analysis between entity matching and training accuracy.
  • Figure 4: Comparison of different entity matching weights.
  • Figure 5: Comparison of Normalized entity match rate in thoughts and entire trajectories.