Do We Need Adam? Surprisingly Strong and Sparse Reinforcement Learning with SGD in LLMs
Sagnik Mukherjee, Lifan Yuan, Pavan Jayasinha, Dilek Hakkani-Tür, Hao Peng
TL;DR
This work challenges the prevailing use of AdamW for RLVR in LLMs by showing that SGD can match or surpass AdamW across multiple models, tasks, and RL algorithms. It reveals that RLVR induces highly sparse and low-rank parameter updates, leading to substantial memory savings and suggesting optimization in RLVR operates within a low-dimensional subspace. The authors demonstrate that removing momentum and per-parameter adaptive learning rates does not sacrifice performance, and in fact often improves efficiency, raising practical implications for scalable RL in language models. The results imply that RL optimization dynamics differ markedly from SFT, motivating RLVR-specific algorithmic choices and opening avenues for more memory-efficient and scalable RL fine-tuning of LLMs.
Abstract
Reinforcement learning (RL), particularly RL from verifiable reward (RLVR), has become a crucial phase of training large language models (LLMs) and a key focus of current scaling efforts. However, optimization practices in RL largely follow those of next-token prediction stages (e.g., pretraining and supervised fine-tuning), despite fundamental differences between RL and these stages highlighted by recent work. One such practice is the use of the AdamW optimizer, which is widely adopted for training large-scale transformers despite its high memory overhead. Our analysis shows that both momentum and adaptive learning rates in AdamW are less influential in RL than in SFT, leading us to hypothesize that RL benefits less from Adam-style per-parameter adaptive learning rates and momentum. Confirming this hypothesis, our experiments demonstrate that the substantially more memory-efficient SGD, which is known to perform poorly in supervised learning of large-scale transformers, matches or even outperforms AdamW in RL for LLMs. Remarkably, full fine-tuning with SGD updates fewer than 0.02% of model parameters without any sparsity-promoting regularization, more than 1000 times fewer than AdamW. Our analysis offers potential reasons for this update sparsity. These findings provide new insights into the optimization dynamics of RL in LLMs and show that RL can be substantially more parameter-efficient than previously recognized.
