Table of Contents
Fetching ...

In Search of Adam's Secret Sauce

Antonio Orvieto, Robert M. Gower

TL;DR

<3-5 sentence high-level summary> The paper addresses why Adam remains the premier optimizer for training large transformer language models by conducting an extensive large-scale empirical study across 1500 models and varied data settings, comparing Adam to simplified variants like Signum and SignSGD with momentum. It discovers that setting beta1 = beta2 yields near-optimal performance across diverse configurations and provides a new online variational-inference interpretation in which Adam estimates the mean and variance of gradients, effectively implementing a data-dependent, adaptive trust region. The work demonstrates that Signum can close much of the SGD-Adam gap but generally underperforms Adam, and situates equal-betas as a robust simplification with theoretical and practical benefits. These findings offer a principled, scalable perspective on Adam’s secret sauce and guide practical hyperparameter choices for large-scale language-model training.

Abstract

Understanding the remarkable efficacy of Adam when training transformer-based language models has become a central research topic within the optimization community. To gain deeper insights, several simplifications of Adam have been proposed, such as the signed gradient and signed momentum methods. In this work, we conduct an extensive empirical study - training over 1500 language models across different data configurations and scales - comparing Adam to several known simplified variants. We find that signed momentum methods are faster than SGD, but consistently underperform relative to Adam, even after careful tuning of momentum, clipping setting and learning rates. However, our analysis reveals a compelling option that preserves near-optimal performance while allowing for new insightful reformulations: constraining the Adam momentum parameters to be equal, beta1 = beta2. Beyond robust performance, this choice affords new theoretical insights, highlights the "secret sauce" on top of signed momentum, and grants a precise statistical interpretation: we show that Adam in this setting implements a natural online algorithm for estimating the mean and variance of gradients-one that arises from a mean-field Gaussian variational inference perspective.

In Search of Adam's Secret Sauce

TL;DR

<3-5 sentence high-level summary> The paper addresses why Adam remains the premier optimizer for training large transformer language models by conducting an extensive large-scale empirical study across 1500 models and varied data settings, comparing Adam to simplified variants like Signum and SignSGD with momentum. It discovers that setting beta1 = beta2 yields near-optimal performance across diverse configurations and provides a new online variational-inference interpretation in which Adam estimates the mean and variance of gradients, effectively implementing a data-dependent, adaptive trust region. The work demonstrates that Signum can close much of the SGD-Adam gap but generally underperforms Adam, and situates equal-betas as a robust simplification with theoretical and practical benefits. These findings offer a principled, scalable perspective on Adam’s secret sauce and guide practical hyperparameter choices for large-scale language-model training.

Abstract

Understanding the remarkable efficacy of Adam when training transformer-based language models has become a central research topic within the optimization community. To gain deeper insights, several simplifications of Adam have been proposed, such as the signed gradient and signed momentum methods. In this work, we conduct an extensive empirical study - training over 1500 language models across different data configurations and scales - comparing Adam to several known simplified variants. We find that signed momentum methods are faster than SGD, but consistently underperform relative to Adam, even after careful tuning of momentum, clipping setting and learning rates. However, our analysis reveals a compelling option that preserves near-optimal performance while allowing for new insightful reformulations: constraining the Adam momentum parameters to be equal, beta1 = beta2. Beyond robust performance, this choice affords new theoretical insights, highlights the "secret sauce" on top of signed momentum, and grants a precise statistical interpretation: we show that Adam in this setting implements a natural online algorithm for estimating the mean and variance of gradients-one that arises from a mean-field Gaussian variational inference perspective.

Paper Structure

This paper contains 56 sections, 5 theorems, 57 equations, 24 figures, 3 tables.

Key Result

Proposition 1

Let $m_k = \texttt{EMA}_{\beta}[g_k]$. Then the update eq:adam-1d admits the equivalent representation:

Figures (24)

  • Figure 1: Pretraining on SlimPajama with Chinchilla-optimal hoffmann2022training scaling. Both momentum and learning rates for Signum are extensively tuned (§\ref{['sec:exp']}). While Signum closes $96\%$ of the perplexity gap between Adam and SGD with momentum (Table \ref{['tab:algorithm-performance']}), still results in a $25\%$ slowdown : Adam achieves the same performance with 3/4 of the budget.
  • Figure 2: Training a total of 265 language models with 160M parameters with 3.2B SlimPajama-627B tokens, sequence length of 2048, batch size of 256. Shown is the final test perplexity on 100M held-out tokens. Some underperforming runs are not shown to keep focus on the most interesting range. For a careful description of our tuning grid, see §\ref{['app:exp-details']}. Takeaway 1: Validation perplexity of highly tuned (65 hyperparameter configurations) Signum with weight decay 0.1 -- top row -- is around 23.23 (see Table \ref{['tab:algorithm-performance']} for multiple seeds at optimal tuning). We ablate on the momentum parameter, learning rate, and presence of global clipping before averaging. The best performance of Signum is reported as a green horizontal line on the second row (200 Adam runs, with weight decay of $0.1$). Most Adam runs perform better than optimally tuned Signum. Takeaway 2: For each $\beta_1$, the optimal corresponding $\beta_2$ (after learning rate tuning) is similar. The higher $\beta_1$, the higher $\beta_2$ for optimal performance (optimal $\beta$s are correlated).
  • Figure 3: Summary of the results in §\ref{['app:more_batches']}. At different batch sizes, for each $\beta_1\in[0.9, 0.95, 0.975]$, we show the best-performing $\beta_2$ (highest score, yellow) and the gap between its performance and that of other options in the grid. We notice high correlation between beta values (e.g., $\beta_2=0.9875$ is a terrible option at $\beta_1=0.9$, but a good one at $\beta_1=0.975$). While results are noisy, notice that $\beta_1=\beta_2$ never degrades performance more than $0.3$ points. In contrast (Table \ref{['tab:algorithm-performance']}, the gap with Signum can be as high as $1.37$ points.
  • Figure 4: Adding an $\epsilon$ mollifier to Signum, i.e., using $m_k / (\sqrt{m_k^2} + \epsilon)$ offered little to no improvement. We also test both zero initialization (ZI) and gradient initialization (GI) for $m$, and find similar results with no significant improvement. $\epsilon=1e-3$ is significantly worse, hence is not shown. Similar finding: Figure \ref{['fig:epsilon-signum-quad']}.
  • Figure 5: The final validation performance (100M held-out tokens) for 44 trained LMs with 410M parameters trained on 8.2 B SlimPajama tokens (Chinchilla-optimal). Equal betas yields near-optimal performance. We use gradient clipping and a batch size of 512 (scaled by 2 compared to Figure \ref{['fig:big_sweep']}, as suggested by zhang2025how). Sequence length is 2048, weight decay is $0.1$. Note that the standard setting $(0.9, 0.95)$ is quite suboptimal here.
  • ...and 19 more figures

Theorems & Definitions (8)

  • Proposition 1
  • theorem 0
  • Proposition 1
  • proof : Proof of Proposition \ref{['prop:adammol']}
  • Proposition 2
  • proof : Proof of Proposition \ref{['prop:adamol-gen']}.
  • theorem 0
  • proof