Table of Contents
Fetching ...

AdamS: Momentum Itself Can Be A Normalizer for LLM Pretraining and Post-training

Huishuai Zhang, Bohan Wang, Luoxin Chen

TL;DR

AdamS tackles the high memory cost of Adam-based optimizers in large-language-model training by replacing second-moment estimates with a momentum-gradient-based denominator, achieving memory footprints comparable to SGD with momentum while matching AdamW performance. It grounds the design in the observed $(L_0,L_1)$-smoothness of transformer objectives and uses momentum as a robust proxy for gradient magnitude to inform adaptive steps, enabling a drop-in replacement that inherits AdamW hyperparameters. Theoretically, AdamS provably converges under sub-gaussian gradient noise with a rate of $ ilde{O}(T^{-1/4})$, matching known lower bounds for gradient-based methods. Empirically, it demonstrates strong performance on GPT-2 and Llama2 pretraining and RL post-training, with memory savings and, in some settings, increased throughput, making it a practical default for scalable LLM optimization.

Abstract

We introduce AdamS, a simple yet effective alternative to Adam for large language model (LLM) pretraining and post-training. By leveraging a novel denominator, i.e., the root of weighted sum of squares of the momentum and the current gradient, AdamS eliminates the need for second-moment estimates. Hence, AdamS is efficient, matching the memory and compute footprint of SGD with momentum while delivering superior optimization performance. Moreover, AdamS is easy to adopt: it can directly inherit hyperparameters of AdamW, and is entirely model-agnostic, integrating seamlessly into existing pipelines without modifications to optimizer APIs or architectures. The motivation behind AdamS stems from the observed $(L_0, L_1)$ smoothness properties in transformer objectives, where local smoothness is governed by gradient magnitudes that can be further approximated by momentum magnitudes. We establish rigorous theoretical convergence guarantees and provide practical guidelines for hyperparameter selection. Empirically, AdamS demonstrates strong performance in various tasks, including pre-training runs on GPT-2 and Llama2 (up to 13B parameters) and reinforcement learning in post-training regimes. With its efficiency, simplicity, and theoretical grounding, AdamS stands as a compelling alternative to existing optimizers.

AdamS: Momentum Itself Can Be A Normalizer for LLM Pretraining and Post-training

TL;DR

AdamS tackles the high memory cost of Adam-based optimizers in large-language-model training by replacing second-moment estimates with a momentum-gradient-based denominator, achieving memory footprints comparable to SGD with momentum while matching AdamW performance. It grounds the design in the observed -smoothness of transformer objectives and uses momentum as a robust proxy for gradient magnitude to inform adaptive steps, enabling a drop-in replacement that inherits AdamW hyperparameters. Theoretically, AdamS provably converges under sub-gaussian gradient noise with a rate of , matching known lower bounds for gradient-based methods. Empirically, it demonstrates strong performance on GPT-2 and Llama2 pretraining and RL post-training, with memory savings and, in some settings, increased throughput, making it a practical default for scalable LLM optimization.

Abstract

We introduce AdamS, a simple yet effective alternative to Adam for large language model (LLM) pretraining and post-training. By leveraging a novel denominator, i.e., the root of weighted sum of squares of the momentum and the current gradient, AdamS eliminates the need for second-moment estimates. Hence, AdamS is efficient, matching the memory and compute footprint of SGD with momentum while delivering superior optimization performance. Moreover, AdamS is easy to adopt: it can directly inherit hyperparameters of AdamW, and is entirely model-agnostic, integrating seamlessly into existing pipelines without modifications to optimizer APIs or architectures. The motivation behind AdamS stems from the observed smoothness properties in transformer objectives, where local smoothness is governed by gradient magnitudes that can be further approximated by momentum magnitudes. We establish rigorous theoretical convergence guarantees and provide practical guidelines for hyperparameter selection. Empirically, AdamS demonstrates strong performance in various tasks, including pre-training runs on GPT-2 and Llama2 (up to 13B parameters) and reinforcement learning in post-training regimes. With its efficiency, simplicity, and theoretical grounding, AdamS stands as a compelling alternative to existing optimizers.

Paper Structure

This paper contains 21 sections, 7 theorems, 51 equations, 6 figures, 4 tables, 3 algorithms.

Key Result

Theorem 3.2

Let Assumptions assum: objective and assum: noise hold. Then, setting $\eta_t = \tilde{\mathcal{O}}(\frac{1}{\sqrt{T}})$, $\beta_1 = 1- \tilde{\Theta}(\frac{1}{\sqrt{T}})$, and $\beta_2 = 1 - \tilde{\Theta}(\frac{1}{T})$ with $\frac{1-\beta_1}{\eta} \ge C$, where $C$ is some constant defined in Eq.

Figures (6)

  • Figure 1: Training and validation loss curves for pretraining LLaMA 2–7B models. The proposed AdamS achieves convergence comparable to or better than baseline methods under the same hyperparameter settings as LLaMA 2 touvron2023llama2, while eliminating the need to store AdamW’s second-moment estimates.
  • Figure 2: The cosine similarities between the update matrices of AdamS and AdamW (left), Adam-mini and AdamW (right) for all layers of GPT2-Small model. Across the training trajectory, the update direction of AdamS closely aligns with that of AdamW.
  • Figure 3: The update magnitude of AdamS for grad/momentum varying with $\beta_1=0.9$ and $\beta_2=0.9,0.95,0.99,0.999$.
  • Figure 4: Validation loss curves for pretraining GPT-2 models. Across three different model sizes and with the same hyperparameters as AdamW, the proposed AdamS achieves convergence comparable to baseline methods—without the need to store AdamW’s second-moment estimates.
  • Figure 5: Mean critic scores for reinforcement learning (RL) post-training using the GRPO algorithm on the CountDown task are presented for the Qwen2.5-3B and DeepSeek-R1-Distill-Llama-8B models. The proposed AdamS closely resembles AdamW’s performance trajectory, achieving similar convergence curves. In contrast, Lion with default hyperparameters demonstrates significantly slower convergence under the same conditions.
  • ...and 1 more figures

Theorems & Definitions (15)

  • Theorem 3.2
  • proof
  • Lemma D.1
  • proof
  • Lemma D.2
  • proof
  • Lemma D.3
  • proof
  • Lemma D.4
  • proof
  • ...and 5 more