Table of Contents
Fetching ...

On GRPO Collapse in Search-R1: The Lazy Likelihood-Displacement Death Spiral

Wenlong Deng, Yushu Li, Boying Gong, Yi Ren, Christos Thrampoulidis, Xiaoxiao Li

TL;DR

The paper identifies Lazy Likelihood Displacement (LLD) as the central instability in GRPO-based tool-integrated RL for LLMs, where likelihood of correct responses decays despite improving rewards, triggering a self-reinforcing train of instability. It introduces LLDS, a targeted likelihood-preserving regularizer with token-level and trajectory-level gating (and an LLDS-MA variant), to stabilize training and prevent gradient explosions. Across seven open-domain and multi-hop QA benchmarks, LLDS and its MA variant yield substantial performance gains (e.g., up to +37.8% on 3B and +32.0% on 7B) and robust training stability. The work provides both a mechanistic understanding of LLD in tool-based GRPO and a practical, scalable solution to enable reliable, multi-turn tool use in agentic LLMs.

Abstract

Tool-integrated (TI) reinforcement learning (RL) enables large language models (LLMs) to perform multi-step reasoning by interacting with external tools such as search engines and retrievers. Group Relative Policy Optimization (GRPO), exemplified by the recent Search-R1, offers fast convergence and a value-free formulation that makes it appealing for this setting, yet consistently suffers from training collapse. We identify Lazy Likelihood Displacement (LLD), a systematic reduction or stagnation in the likelihood of both correct and incorrect responses, as the core mechanism driving this failure. LLD emerges early and triggers a self-reinforcing LLD Death Spiral, where declining likelihood leads to low-confidence responses, inflating gradients, and ultimately causing collapse. We empirically characterize this process across models on a Search-R1-style, search-integrated question answering task, revealing a consistent three-phase trajectory: early stagnation, steady decay, and accelerated collapse. To address this, we propose a lightweight likelihood-preserving regularization LLDS for GRPO that activates only when a trajectory's likelihood decreases, and regularizes only the tokens responsible. This fine-grained structure mitigates LLD with minimal interference to optimization. Across seven open-domain and multi-hop QA benchmarks, our method stabilizes training, prevents gradient explosion, and yields substantial performance improvements, including +37.8% gains on Qwen2.5-3B and +32.0% gains on Qwen2.5-7B. Our results establish LLD as a fundamental bottleneck in GRPO-based TIRL and provide a practical path toward stable, scalable training of tool-integrated LLM.

On GRPO Collapse in Search-R1: The Lazy Likelihood-Displacement Death Spiral

TL;DR

The paper identifies Lazy Likelihood Displacement (LLD) as the central instability in GRPO-based tool-integrated RL for LLMs, where likelihood of correct responses decays despite improving rewards, triggering a self-reinforcing train of instability. It introduces LLDS, a targeted likelihood-preserving regularizer with token-level and trajectory-level gating (and an LLDS-MA variant), to stabilize training and prevent gradient explosions. Across seven open-domain and multi-hop QA benchmarks, LLDS and its MA variant yield substantial performance gains (e.g., up to +37.8% on 3B and +32.0% on 7B) and robust training stability. The work provides both a mechanistic understanding of LLD in tool-based GRPO and a practical, scalable solution to enable reliable, multi-turn tool use in agentic LLMs.

Abstract

Tool-integrated (TI) reinforcement learning (RL) enables large language models (LLMs) to perform multi-step reasoning by interacting with external tools such as search engines and retrievers. Group Relative Policy Optimization (GRPO), exemplified by the recent Search-R1, offers fast convergence and a value-free formulation that makes it appealing for this setting, yet consistently suffers from training collapse. We identify Lazy Likelihood Displacement (LLD), a systematic reduction or stagnation in the likelihood of both correct and incorrect responses, as the core mechanism driving this failure. LLD emerges early and triggers a self-reinforcing LLD Death Spiral, where declining likelihood leads to low-confidence responses, inflating gradients, and ultimately causing collapse. We empirically characterize this process across models on a Search-R1-style, search-integrated question answering task, revealing a consistent three-phase trajectory: early stagnation, steady decay, and accelerated collapse. To address this, we propose a lightweight likelihood-preserving regularization LLDS for GRPO that activates only when a trajectory's likelihood decreases, and regularizes only the tokens responsible. This fine-grained structure mitigates LLD with minimal interference to optimization. Across seven open-domain and multi-hop QA benchmarks, our method stabilizes training, prevents gradient explosion, and yields substantial performance improvements, including +37.8% gains on Qwen2.5-3B and +32.0% gains on Qwen2.5-7B. Our results establish LLD as a fundamental bottleneck in GRPO-based TIRL and provide a practical path toward stable, scalable training of tool-integrated LLM.

Paper Structure

This paper contains 29 sections, 3 theorems, 25 equations, 15 figures, 2 tables.

Key Result

Theorem 4.2

In tool-integrated GRPO, the likelihood of an otherwise correct response can decrease when (i) incorrect responses of low likelihood and (ii) incorrect responses whose embeddings closely resemble the correct one induce large negative gradients that dominate the positive updates. These forces jointly

Figures (15)

  • Figure 1: Comparative performance of LLDS and baseline methods on benchmark datasets. All baselines are built upon Qwen2.5-7B-Instruct. See \ref{['sec:reuslt']} for details.
  • Figure 2: We illustrate the likelihood displacement in tool-integrated RL training. The steady-decay phase (60-120) emerges when the reward begins to increase only gradually. In the subsequent acceleration phase (after step 120), the likelihood of correct responses drops sharply, accompanied by a sudden surge in gradient magnitude (red star), leading to gradient explosion. A zoomed-in view of the acceleration region further highlights this effect, showing a clearer likelihood displacement, where the gradient accelerates rapidly while the reward starts to decline.
  • Figure 3: Effect of likelihood displacement across different training iterations for the Qwen2.5-3B-Instruct model. Results are computed on the first 50 samples of the training set, discarding cases where all responses are uniformly correct or uniformly incorrect. Bars below zero (orange) indicate samples whose correct responses' likelihood decreases after training.
  • Figure 4: We illustrate how entropy, response length, and valid-search ratio evolve during training. For both Qwen2.5-3B-Instruct (a) and Qwen2.5-3B-Base (b), entropy exhibits a accerlerrated upward trend prior to collapse indicating a strong LD issue. Meanwhile, the response length and valid-search times remain stable in the early stages but later begin to fluctuate markedly and eventually drop sharply.
  • Figure 5: Evolution of token log-likelihood (measured before vs. after feedback; left axis) and the observation-match ratio for wrong answers (right axis). With training, both likelihoods drop while the overlap between tool observations in incorrect and correct trajectories increases, suggesting many incorrect responses begin with a correct search, which skews likelihood estimates and contributes to LLD.
  • ...and 10 more figures

Theorems & Definitions (5)

  • Definition 4.1: Tool-Lazy Likelihood Displacement
  • Theorem 4.2: Informal: Trajectory-Level LLD in Tool-Integrated GRPO
  • Definition 5.1: LLD Death Spiral
  • Theorem A.1: Trajectory-Level LLD in Tool-Integrated GRPO
  • Theorem A.2: Action-level