Table of Contents
Fetching ...

Stratified GRPO: Handling Structural Heterogeneity in Reinforcement Learning of LLM Search Agents

Mingkang Zhu, Xi Chen, Bei Yu, Hengshuang Zhao, Jiaya Jia

TL;DR

The paper addresses cross-stratum bias that arises when applying policy-gradient RL to structurally heterogeneous LLM search trajectories. It introduces Stratified GRPO, centered on Stratified Advantage Normalization (SAN), which partitions trajectories by structure and computes within-stratum advantages to remove cross-stratum bias, while proving SAN is conditionally unbiased with unit variance and invariant to positive affine reward transforms when $\varepsilon=0$. A blended variant combines SAN with global normalization to maintain stability in finite samples. Empirically, Stratified GRPO yields up to 11.3 points higher performance and better training stability across seven single- and multi-hop QA benchmarks, particularly improving multi-turn search policies. Overall, stratification provides a principled remedy for structural heterogeneity in RL for LLM search agents, enabling more reliable learning of complex search strategies.

Abstract

Large language model (LLM) agents increasingly rely on external tools such as search engines to solve complex, multi-step problems, and reinforcement learning (RL) has become a key paradigm for training them. However, the trajectories of search agents are structurally heterogeneous, where variations in the number, placement, and outcomes of search calls lead to fundamentally different answer directions and reward distributions. Standard policy gradient methods, which use a single global baseline, suffer from what we identify and formalize as cross-stratum bias-an "apples-to-oranges" comparison of heterogeneous trajectories. This cross-stratum bias distorts credit assignment and hinders exploration of complex, multi-step search strategies. To address this, we propose Stratified GRPO, whose central component, Stratified Advantage Normalization (SAN), partitions trajectories into homogeneous strata based on their structural properties and computes advantages locally within each stratum. This ensures that trajectories are evaluated only against their true peers. Our analysis proves that SAN eliminates cross-stratum bias, yields conditionally unbiased unit-variance estimates inside each stratum, and retains the global unbiasedness and unit-variance properties enjoyed by standard normalization, resulting in a more pure and scale-stable learning signal. To improve practical stability under finite-sample regimes, we further linearly blend SAN with the global estimator. Extensive experiments on diverse single-hop and multi-hop question-answering benchmarks demonstrate that Stratified GRPO consistently and substantially outperforms GRPO by up to 11.3 points, achieving higher training rewards, greater training stability, and more effective search policies. These results establish stratification as a principled remedy for structural heterogeneity in RL for LLM search agents.

Stratified GRPO: Handling Structural Heterogeneity in Reinforcement Learning of LLM Search Agents

TL;DR

The paper addresses cross-stratum bias that arises when applying policy-gradient RL to structurally heterogeneous LLM search trajectories. It introduces Stratified GRPO, centered on Stratified Advantage Normalization (SAN), which partitions trajectories by structure and computes within-stratum advantages to remove cross-stratum bias, while proving SAN is conditionally unbiased with unit variance and invariant to positive affine reward transforms when . A blended variant combines SAN with global normalization to maintain stability in finite samples. Empirically, Stratified GRPO yields up to 11.3 points higher performance and better training stability across seven single- and multi-hop QA benchmarks, particularly improving multi-turn search policies. Overall, stratification provides a principled remedy for structural heterogeneity in RL for LLM search agents, enabling more reliable learning of complex search strategies.

Abstract

Large language model (LLM) agents increasingly rely on external tools such as search engines to solve complex, multi-step problems, and reinforcement learning (RL) has become a key paradigm for training them. However, the trajectories of search agents are structurally heterogeneous, where variations in the number, placement, and outcomes of search calls lead to fundamentally different answer directions and reward distributions. Standard policy gradient methods, which use a single global baseline, suffer from what we identify and formalize as cross-stratum bias-an "apples-to-oranges" comparison of heterogeneous trajectories. This cross-stratum bias distorts credit assignment and hinders exploration of complex, multi-step search strategies. To address this, we propose Stratified GRPO, whose central component, Stratified Advantage Normalization (SAN), partitions trajectories into homogeneous strata based on their structural properties and computes advantages locally within each stratum. This ensures that trajectories are evaluated only against their true peers. Our analysis proves that SAN eliminates cross-stratum bias, yields conditionally unbiased unit-variance estimates inside each stratum, and retains the global unbiasedness and unit-variance properties enjoyed by standard normalization, resulting in a more pure and scale-stable learning signal. To improve practical stability under finite-sample regimes, we further linearly blend SAN with the global estimator. Extensive experiments on diverse single-hop and multi-hop question-answering benchmarks demonstrate that Stratified GRPO consistently and substantially outperforms GRPO by up to 11.3 points, achieving higher training rewards, greater training stability, and more effective search policies. These results establish stratification as a principled remedy for structural heterogeneity in RL for LLM search agents.

Paper Structure

This paper contains 42 sections, 10 theorems, 62 equations, 1 figure, 3 tables, 1 algorithm.

Key Result

Proposition 1

For any trajectory $\tau_i \in B_k$, the global advantage decomposes as

Figures (1)

  • Figure 1: Training dynamics of Stratified GRPO and GRPO. The left plots show training rewards, and the right plots show the number of search calls per question over training steps.

Theorems & Definitions (23)

  • Proposition 1: Advantage Decomposition
  • Theorem 1: Variance Reduction via Stratified Baselines
  • Definition 1
  • Proposition 2: Invariance to Positive Affine Reward Transforms
  • Theorem 2: Variance Decomposition for Normalized Stratified Advantage
  • Theorem 3: Population SAN Expectation
  • Proposition 3: Exact Advantage Decomposition
  • Theorem 4: Conditional Properties of SAN and GN Advantages
  • Theorem 5: Global Moments of SAN and GN
  • Definition 2: Blended Advantage
  • ...and 13 more