Table of Contents
Fetching ...

On Predictability of Reinforcement Learning Dynamics for Large Language Models

Yuchen Cai, Ding Cao, Xin Xu, Zijun Yao, Yuqing Huang, Zhenyu Tan, Benyi Zhang, Guiquan Liu, Junfeng Fang

TL;DR

The paper addresses the opacity of parameter dynamics in RL-fine-tuned LLMs by uncovering two laws: Rank-1 Dominance, where the top singular subspace of the parameter update largely determines reasoning gains, and Rank-1 Linear Dynamics, where this subspace evolves nearly linearly during training. Leveraging these insights, it introduces AlphaRL, a plug-in acceleration that predicts final updates from an early training window using rank-1 trajectories and PLS regression, achieving up to 2.5× speedups while preserving over 96% of final reasoning performance without extra modules or tuning. The approach is validated across 8 models and 7 RL algorithms on multiple reasoning benchmarks, demonstrating strong generality and the potential for interpretable, efficient RL for large-scale reasoning models. Together, these results point to a principled, low-dimensional core governing RL-induced reasoning in LLMs, with practical implications for scalable training and deployment.

Abstract

Recent advances in reasoning capabilities of large language models (LLMs) are largely driven by reinforcement learning (RL), yet the underlying parameter dynamics during RL training remain poorly understood. This work identifies two fundamental properties of RL-induced parameter updates in LLMs: (1) Rank-1 Dominance, where the top singular subspace of the parameter update matrix nearly fully determines reasoning improvements, recovering over 99\% of performance gains; and (2) Rank-1 Linear Dynamics, where this dominant subspace evolves linearly throughout training, enabling accurate prediction from early checkpoints. Extensive experiments across 8 LLMs and 7 algorithms validate the generalizability of these properties. More importantly, based on these findings, we propose AlphaRL, a plug-in acceleration framework that extrapolates the final parameter update using a short early training window, achieving up to 2.5 speedup while retaining \textgreater 96\% of reasoning performance without extra modules or hyperparameter tuning. This positions our finding as a versatile and practical tool for large-scale RL, opening a path toward principled, interpretable, and efficient training paradigm for LLMs.

On Predictability of Reinforcement Learning Dynamics for Large Language Models

TL;DR

The paper addresses the opacity of parameter dynamics in RL-fine-tuned LLMs by uncovering two laws: Rank-1 Dominance, where the top singular subspace of the parameter update largely determines reasoning gains, and Rank-1 Linear Dynamics, where this subspace evolves nearly linearly during training. Leveraging these insights, it introduces AlphaRL, a plug-in acceleration that predicts final updates from an early training window using rank-1 trajectories and PLS regression, achieving up to 2.5× speedups while preserving over 96% of final reasoning performance without extra modules or tuning. The approach is validated across 8 models and 7 RL algorithms on multiple reasoning benchmarks, demonstrating strong generality and the potential for interpretable, efficient RL for large-scale reasoning models. Together, these results point to a principled, low-dimensional core governing RL-induced reasoning in LLMs, with practical implications for scalable training and deployment.

Abstract

Recent advances in reasoning capabilities of large language models (LLMs) are largely driven by reinforcement learning (RL), yet the underlying parameter dynamics during RL training remain poorly understood. This work identifies two fundamental properties of RL-induced parameter updates in LLMs: (1) Rank-1 Dominance, where the top singular subspace of the parameter update matrix nearly fully determines reasoning improvements, recovering over 99\% of performance gains; and (2) Rank-1 Linear Dynamics, where this dominant subspace evolves linearly throughout training, enabling accurate prediction from early checkpoints. Extensive experiments across 8 LLMs and 7 algorithms validate the generalizability of these properties. More importantly, based on these findings, we propose AlphaRL, a plug-in acceleration framework that extrapolates the final parameter update using a short early training window, achieving up to 2.5 speedup while retaining \textgreater 96\% of reasoning performance without extra modules or hyperparameter tuning. This positions our finding as a versatile and practical tool for large-scale RL, opening a path toward principled, interpretable, and efficient training paradigm for LLMs.

Paper Structure

This paper contains 18 sections, 45 equations, 28 figures, 2 tables.

Figures (28)

  • Figure 1: Comparison between RL-trained models and their Rank-1% parameter update counterparts across five reasoning benchmarks. The results demonstrate that retaining only the Top 1% of the parameter update matrix is sufficient to recover the reasoning gains achieved by RL-trained models. More detailed experimental settings and results are exhibited in Section \ref{['section2']}. Best viewed in color.
  • Figure 2: Overview of our key findings and method. (a) Rank-1 Dominance: The majority of reasoning improvements induced by RL can be captured by the Rank-1 Subspace of the parameter update $\Delta W$, which throughout the RL training process. (b) AlphaRL: Leveraging Rank-1 Linear Dynamics, AlphaRL predicts the trajectory of the Rank-1 Subspace, allowing models to reach final performance with fewer RL training steps. Best viewed in color.
  • Figure 3: (a) Performance under Rank-1 and Rank-$k\%$ Subspaces on MATH-500; (b) Performance of the Rank-1 Subspace across training. Best viewed in color.
  • Figure 4: (a) Effect of different single subspaces on performance; (b) Effect of scaling the Rank-1 Subspace updates on performance. Best viewed in color.
  • Figure 5: (a) L2 norm of updates across methods and the fraction of update information captured by the unscaled Rank-1 and Rank-1% Subspaces; (b) Effect of different update methods on the embedding layer, with the two embedding representations of the same token connected by gray lines. Best viewed in color.
  • ...and 23 more figures