On Predictability of Reinforcement Learning Dynamics for Large Language Models
Yuchen Cai, Ding Cao, Xin Xu, Zijun Yao, Yuqing Huang, Zhenyu Tan, Benyi Zhang, Guiquan Liu, Junfeng Fang
TL;DR
The paper addresses the opacity of parameter dynamics in RL-fine-tuned LLMs by uncovering two laws: Rank-1 Dominance, where the top singular subspace of the parameter update largely determines reasoning gains, and Rank-1 Linear Dynamics, where this subspace evolves nearly linearly during training. Leveraging these insights, it introduces AlphaRL, a plug-in acceleration that predicts final updates from an early training window using rank-1 trajectories and PLS regression, achieving up to 2.5× speedups while preserving over 96% of final reasoning performance without extra modules or tuning. The approach is validated across 8 models and 7 RL algorithms on multiple reasoning benchmarks, demonstrating strong generality and the potential for interpretable, efficient RL for large-scale reasoning models. Together, these results point to a principled, low-dimensional core governing RL-induced reasoning in LLMs, with practical implications for scalable training and deployment.
Abstract
Recent advances in reasoning capabilities of large language models (LLMs) are largely driven by reinforcement learning (RL), yet the underlying parameter dynamics during RL training remain poorly understood. This work identifies two fundamental properties of RL-induced parameter updates in LLMs: (1) Rank-1 Dominance, where the top singular subspace of the parameter update matrix nearly fully determines reasoning improvements, recovering over 99\% of performance gains; and (2) Rank-1 Linear Dynamics, where this dominant subspace evolves linearly throughout training, enabling accurate prediction from early checkpoints. Extensive experiments across 8 LLMs and 7 algorithms validate the generalizability of these properties. More importantly, based on these findings, we propose AlphaRL, a plug-in acceleration framework that extrapolates the final parameter update using a short early training window, achieving up to 2.5 speedup while retaining \textgreater 96\% of reasoning performance without extra modules or hyperparameter tuning. This positions our finding as a versatile and practical tool for large-scale RL, opening a path toward principled, interpretable, and efficient training paradigm for LLMs.
