Repurposing Synthetic Data for Fine-grained Search Agent Supervision
Yida Zhao, Kuan Li, Xixi Wu, Liwen Zhang, Dingchu Zhang, Baixuan Li, Maojia Song, Zhuo Chen, Chenxi Wang, Xinyu Wang, Kewei Tu, Pengjun Xie, Jingren Zhou, Yong Jiang
TL;DR
This work tackles reward sparsity in GRPO-based training for LLM-powered search agents by repurposing ground-truth entities from synthetic data as a dense, fine-grained reward signal. The authors formalize entity match rate $\gamma_i$ and its normalized form $\hat{\gamma}_i$ as proxies for reasoning quality and introduce Entity-aware Group Relative Policy Optimization (E-GRPO), which provides a non-binary bonus to negative samples proportional to $\hat{\gamma}_i$ with a balancing factor $\alpha$. Empirical results across 11 QA and deep-research benchmarks show that E-GRPO consistently outperforms GRPO in both Local and Web environments and yields more efficient reasoning with fewer tool calls. Analyses reveal a strong association between entity matching and accuracy, and ablations identify a moderate $\alpha$ (around 0.3) as optimal, highlighting the method’s robustness and sample efficiency. Overall, E-GRPO offers a practical, scalable improvement for aligning search agents in knowledge-intensive tasks without additional annotation or sampling overhead.
Abstract
LLM-based search agents are increasingly trained on entity-centric synthetic data to solve complex, knowledge-intensive tasks. However, prevailing training methods like Group Relative Policy Optimization (GRPO) discard this rich entity information, relying instead on sparse, outcome-based rewards. This critical limitation renders them unable to distinguish informative "near-miss" samples-those with substantially correct reasoning but a flawed final answer-from complete failures, thus discarding valuable learning signals. We address this by leveraging the very entities discarded during training. Our empirical analysis reveals a strong positive correlation between the number of ground-truth entities identified during an agent's reasoning process and final answer accuracy. Building on this insight, we introduce Entity-aware Group Relative Policy Optimization (E-GRPO), a novel framework that formulates a dense entity-aware reward function. E-GRPO assigns partial rewards to incorrect samples proportional to their entity match rate, enabling the model to effectively learn from these "near-misses". Experiments on diverse question-answering (QA) and deep research benchmarks show that E-GRPO consistently and significantly outperforms the GRPO baseline. Furthermore, our analysis reveals that E-GRPO not only achieves superior accuracy but also induces more efficient reasoning policies that require fewer tool calls, demonstrating a more effective and sample-efficient approach to aligning search agents.
