Table of Contents
Fetching ...

Do We Need Adam? Surprisingly Strong and Sparse Reinforcement Learning with SGD in LLMs

Sagnik Mukherjee, Lifan Yuan, Pavan Jayasinha, Dilek Hakkani-Tür, Hao Peng

TL;DR

This work challenges the prevailing use of AdamW for RLVR in LLMs by showing that SGD can match or surpass AdamW across multiple models, tasks, and RL algorithms. It reveals that RLVR induces highly sparse and low-rank parameter updates, leading to substantial memory savings and suggesting optimization in RLVR operates within a low-dimensional subspace. The authors demonstrate that removing momentum and per-parameter adaptive learning rates does not sacrifice performance, and in fact often improves efficiency, raising practical implications for scalable RL in language models. The results imply that RL optimization dynamics differ markedly from SFT, motivating RLVR-specific algorithmic choices and opening avenues for more memory-efficient and scalable RL fine-tuning of LLMs.

Abstract

Reinforcement learning (RL), particularly RL from verifiable reward (RLVR), has become a crucial phase of training large language models (LLMs) and a key focus of current scaling efforts. However, optimization practices in RL largely follow those of next-token prediction stages (e.g., pretraining and supervised fine-tuning), despite fundamental differences between RL and these stages highlighted by recent work. One such practice is the use of the AdamW optimizer, which is widely adopted for training large-scale transformers despite its high memory overhead. Our analysis shows that both momentum and adaptive learning rates in AdamW are less influential in RL than in SFT, leading us to hypothesize that RL benefits less from Adam-style per-parameter adaptive learning rates and momentum. Confirming this hypothesis, our experiments demonstrate that the substantially more memory-efficient SGD, which is known to perform poorly in supervised learning of large-scale transformers, matches or even outperforms AdamW in RL for LLMs. Remarkably, full fine-tuning with SGD updates fewer than 0.02% of model parameters without any sparsity-promoting regularization, more than 1000 times fewer than AdamW. Our analysis offers potential reasons for this update sparsity. These findings provide new insights into the optimization dynamics of RL in LLMs and show that RL can be substantially more parameter-efficient than previously recognized.

Do We Need Adam? Surprisingly Strong and Sparse Reinforcement Learning with SGD in LLMs

TL;DR

This work challenges the prevailing use of AdamW for RLVR in LLMs by showing that SGD can match or surpass AdamW across multiple models, tasks, and RL algorithms. It reveals that RLVR induces highly sparse and low-rank parameter updates, leading to substantial memory savings and suggesting optimization in RLVR operates within a low-dimensional subspace. The authors demonstrate that removing momentum and per-parameter adaptive learning rates does not sacrifice performance, and in fact often improves efficiency, raising practical implications for scalable RL in language models. The results imply that RL optimization dynamics differ markedly from SFT, motivating RLVR-specific algorithmic choices and opening avenues for more memory-efficient and scalable RL fine-tuning of LLMs.

Abstract

Reinforcement learning (RL), particularly RL from verifiable reward (RLVR), has become a crucial phase of training large language models (LLMs) and a key focus of current scaling efforts. However, optimization practices in RL largely follow those of next-token prediction stages (e.g., pretraining and supervised fine-tuning), despite fundamental differences between RL and these stages highlighted by recent work. One such practice is the use of the AdamW optimizer, which is widely adopted for training large-scale transformers despite its high memory overhead. Our analysis shows that both momentum and adaptive learning rates in AdamW are less influential in RL than in SFT, leading us to hypothesize that RL benefits less from Adam-style per-parameter adaptive learning rates and momentum. Confirming this hypothesis, our experiments demonstrate that the substantially more memory-efficient SGD, which is known to perform poorly in supervised learning of large-scale transformers, matches or even outperforms AdamW in RL for LLMs. Remarkably, full fine-tuning with SGD updates fewer than 0.02% of model parameters without any sparsity-promoting regularization, more than 1000 times fewer than AdamW. Our analysis offers potential reasons for this update sparsity. These findings provide new insights into the optimization dynamics of RL in LLMs and show that RL can be substantially more parameter-efficient than previously recognized.
Paper Structure (37 sections, 2 equations, 6 figures, 6 tables)

This paper contains 37 sections, 2 equations, 6 figures, 6 tables.

Figures (6)

  • Figure 1: Training reward (left) and validation reward on MATH (right) comparing SGD and AdamW.
  • Figure 2: Comparison of $\sqrt{v}$ distributions between SFT and RLVR at step 50. RLVR concentrates in a narrower, low-magnitude regime. The standard deviation is ${\sim}22\times$ higher in SFT ($\sigma = 5.11 \times 10^{-6}$) than RLVR ($\sigma = 2.29 \times 10^{-7}$).
  • Figure 3: SGD updates are distributed across the model rather than concentrated in specific layers. Across all layers, SGD produces significantly sparser updates than AdamW.
  • Figure 4: Update sparsity of SGD barely decreases as training proceeds. Following plots are from the math experiments
  • Figure 5: SGD Learning Rate Ablation on Qwen3-8B (NuminaMath- CoT).Left: Training reward curves over optimization steps. SGD converges to comparable or higher rewards than AdamW, provided the learning rate is sufficiently high. Right: Final training reward as a function of learning rate (log scale). SGD requires learning rates orders of magnitude larger than AdamW ($10^{-6}$) to achieve peak performance in the RLVR setting.
  • ...and 1 more figures