On the Plasticity and Stability for Post-Training Large Language Models

Wenwen Qiang; Ziyin Gu; Jiahuan Zhou; Jie Hu; Jingyao Wang; Changwen Zheng; Hui Xiong

On the Plasticity and Stability for Post-Training Large Language Models

Wenwen Qiang, Ziyin Gu, Jiahuan Zhou, Jie Hu, Jingyao Wang, Changwen Zheng, Hui Xiong

TL;DR

Pro Probabilistic Conflict Resolution (PCR) is proposed, a Bayesian framework that models gradients as random variables that dynamically arbitrates conflicts via an uncertainty-aware ``soft projection''mechanism, optimizing the signal-to-noise ratio.

Abstract

Training stability remains a critical bottleneck for Group Relative Policy Optimization (GRPO), often manifesting as a trade-off between reasoning plasticity and general capability retention. We identify a root cause as the geometric conflict between plasticity and stability gradients, which leads to destructive interference. Crucially, we argue that deterministic projection methods are suboptimal for GRPO as they overlook the intrinsic stochasticity of group-based gradient estimates. To address this, we propose Probabilistic Conflict Resolution (PCR), a Bayesian framework that models gradients as random variables. PCR dynamically arbitrates conflicts via an uncertainty-aware ``soft projection'' mechanism, optimizing the signal-to-noise ratio. Extensive experiments demonstrate that PCR significantly smooths the training trajectory and achieves superior performance in various reasoning tasks.

On the Plasticity and Stability for Post-Training Large Language Models

TL;DR

Abstract

Paper Structure (20 sections, 2 theorems, 16 equations, 5 figures, 1 table)

This paper contains 20 sections, 2 theorems, 16 equations, 5 figures, 1 table.

Introduction
Related Work
Problem Formulation and Analysis
Reformulating the GRPO Objective
Dual Decomposition of Loss and Gradients
Empirical Analysis
Methodology
Probabilistic Modeling
Geometric Decomposition
Bayesian Arbitration
Reconstruction and Projection
Policy Optimization
Theoretical Analysis
Experiment
Experimental Settings
...and 5 more sections

Key Result

Proposition 4.1

The optimal update magnitude $x^*$ is governed by the following: Here, $\lambda = 1/\sigma^2$ denotes precision, and the scalar $k \in [0, 1]$ is defined as the retention coefficient.

Figures (5)

Figure 1: Motivating results. (a) Results of AIME accuracy, MMLU score, and PPL, varying with the KL coefficient $\beta$. (b) The Pareto frontier. (c) The layer-wise cosine similarity between plasticity and stability gradients across training steps.
Figure 2: Performance analysis with PCR on code reasoning tasks. We record the 1-shot and 5-shot results on HumanEval.
Figure 3: Stability analyses. We provide the norm of the gradient during training. A stable gradient norm implies consistent updates; large swings suggest unstable or overly aggressive shifts.
Figure 4: Ablation Study. (a) and (b) evaluate the effect of different components within PCR. (c) shows the scalability to larger update magnitudes. More experiments and results are provided in Appendix L.
Figure 5: Visualization results. (a) The distribution of projection strength. (b) The cosine similarity between $\mathbf{g}_{final}$ and $\mathbf{g}_{sta}$. More results are shown in Appendix L.2.

Theorems & Definitions (2)

Proposition 4.1: Optimal Conflict Retention
Theorem 5.1: MMSE Optimality of Soft Projection

On the Plasticity and Stability for Post-Training Large Language Models

TL;DR

Abstract

On the Plasticity and Stability for Post-Training Large Language Models

Authors

TL;DR

Abstract

Table of Contents

Key Result

Figures (5)

Theorems & Definitions (2)