Table of Contents
Fetching ...

On the Plasticity and Stability for Post-Training Large Language Models

Wenwen Qiang, Ziyin Gu, Jiahuan Zhou, Jie Hu, Jingyao Wang, Changwen Zheng, Hui Xiong

TL;DR

Pro Probabilistic Conflict Resolution (PCR) is proposed, a Bayesian framework that models gradients as random variables that dynamically arbitrates conflicts via an uncertainty-aware ``soft projection''mechanism, optimizing the signal-to-noise ratio.

Abstract

Training stability remains a critical bottleneck for Group Relative Policy Optimization (GRPO), often manifesting as a trade-off between reasoning plasticity and general capability retention. We identify a root cause as the geometric conflict between plasticity and stability gradients, which leads to destructive interference. Crucially, we argue that deterministic projection methods are suboptimal for GRPO as they overlook the intrinsic stochasticity of group-based gradient estimates. To address this, we propose Probabilistic Conflict Resolution (PCR), a Bayesian framework that models gradients as random variables. PCR dynamically arbitrates conflicts via an uncertainty-aware ``soft projection'' mechanism, optimizing the signal-to-noise ratio. Extensive experiments demonstrate that PCR significantly smooths the training trajectory and achieves superior performance in various reasoning tasks.

On the Plasticity and Stability for Post-Training Large Language Models

TL;DR

Pro Probabilistic Conflict Resolution (PCR) is proposed, a Bayesian framework that models gradients as random variables that dynamically arbitrates conflicts via an uncertainty-aware ``soft projection''mechanism, optimizing the signal-to-noise ratio.

Abstract

Training stability remains a critical bottleneck for Group Relative Policy Optimization (GRPO), often manifesting as a trade-off between reasoning plasticity and general capability retention. We identify a root cause as the geometric conflict between plasticity and stability gradients, which leads to destructive interference. Crucially, we argue that deterministic projection methods are suboptimal for GRPO as they overlook the intrinsic stochasticity of group-based gradient estimates. To address this, we propose Probabilistic Conflict Resolution (PCR), a Bayesian framework that models gradients as random variables. PCR dynamically arbitrates conflicts via an uncertainty-aware ``soft projection'' mechanism, optimizing the signal-to-noise ratio. Extensive experiments demonstrate that PCR significantly smooths the training trajectory and achieves superior performance in various reasoning tasks.
Paper Structure (20 sections, 2 theorems, 16 equations, 5 figures, 1 table)

This paper contains 20 sections, 2 theorems, 16 equations, 5 figures, 1 table.

Key Result

Proposition 4.1

The optimal update magnitude $x^*$ is governed by the following: Here, $\lambda = 1/\sigma^2$ denotes precision, and the scalar $k \in [0, 1]$ is defined as the retention coefficient.

Figures (5)

  • Figure 1: Motivating results. (a) Results of AIME accuracy, MMLU score, and PPL, varying with the KL coefficient $\beta$. (b) The Pareto frontier. (c) The layer-wise cosine similarity between plasticity and stability gradients across training steps.
  • Figure 2: Performance analysis with PCR on code reasoning tasks. We record the 1-shot and 5-shot results on HumanEval.
  • Figure 3: Stability analyses. We provide the norm of the gradient during training. A stable gradient norm implies consistent updates; large swings suggest unstable or overly aggressive shifts.
  • Figure 4: Ablation Study. (a) and (b) evaluate the effect of different components within PCR. (c) shows the scalability to larger update magnitudes. More experiments and results are provided in Appendix L.
  • Figure 5: Visualization results. (a) The distribution of projection strength. (b) The cosine similarity between $\mathbf{g}_{final}$ and $\mathbf{g}_{sta}$. More results are shown in Appendix L.2.

Theorems & Definitions (2)

  • Proposition 4.1: Optimal Conflict Retention
  • Theorem 5.1: MMSE Optimality of Soft Projection