Table of Contents
Fetching ...

A Minimalist Optimizer Design for LLM Pretraining

Athanasios Glentis, Jiaxiang Li, Andi Han, Mingyi Hong

TL;DR

This work tackles the memory overhead of adaptive optimizers like Adam in LLM pretraining by asking how to reach state-of-the-art performance with minimal SGD modifications. Through a bottom-up study, it identifies two high-signal components—column-wise gradient normalization and first-order momentum applied only to the last layer—that together yield the SCALE optimizer. SCALE matches or exceeds Adam-like performance while using only a fraction of the memory, and consistently outperforms other memory-efficient baselines across multiple LLaMA model sizes. The results establish SCALE as a practical, minimalistic baseline for memory-constrained pretraining and offer design principles for future optimizer research in large-scale language modeling.

Abstract

Training large language models (LLMs) typically relies on adaptive optimizers such as Adam, which introduce extra operations and require significant more memory to maintain first- and second-order moments than SGD. While recent works such as GaLore, Fira and APOLLO have proposed state-compressed variants to reduce memory consumption, a fundamental question remains: What are the minimum modifications to plain SGD needed to match state-of-the-art pretraining performance? We systematically investigate this question using a bottom-up approach, and identify two simple yet highly (memory- and compute-) efficient techniques: (1) column-wise gradient normalization (normalizing the gradient along the output dimension), which boosts SGD performance without momentum; and (2) applying first-order momentum only to the output layer, where gradient variance is highest. Combining these two techniques lead to SCALE (Stochastic Column-normAlized Last-layer momEntum), a simple optimizer for memory efficient pretraining. Across multiple LLaMA models (60M-1B), SCALE matches or exceeds the performance of Adam while using only 35-45% of the total memory. It also consistently outperforms memory-efficient optimizers such as GaLore, Fira and APOLLO, making it a strong candidate for large-scale pretraining under memory constraints. For LLaMA 7B model, SCALE outperforms the state-of-the-art memory-efficient methods APOLLO and Muon, in terms of both perplexity and memory consumption.

A Minimalist Optimizer Design for LLM Pretraining

TL;DR

This work tackles the memory overhead of adaptive optimizers like Adam in LLM pretraining by asking how to reach state-of-the-art performance with minimal SGD modifications. Through a bottom-up study, it identifies two high-signal components—column-wise gradient normalization and first-order momentum applied only to the last layer—that together yield the SCALE optimizer. SCALE matches or exceeds Adam-like performance while using only a fraction of the memory, and consistently outperforms other memory-efficient baselines across multiple LLaMA model sizes. The results establish SCALE as a practical, minimalistic baseline for memory-constrained pretraining and offer design principles for future optimizer research in large-scale language modeling.

Abstract

Training large language models (LLMs) typically relies on adaptive optimizers such as Adam, which introduce extra operations and require significant more memory to maintain first- and second-order moments than SGD. While recent works such as GaLore, Fira and APOLLO have proposed state-compressed variants to reduce memory consumption, a fundamental question remains: What are the minimum modifications to plain SGD needed to match state-of-the-art pretraining performance? We systematically investigate this question using a bottom-up approach, and identify two simple yet highly (memory- and compute-) efficient techniques: (1) column-wise gradient normalization (normalizing the gradient along the output dimension), which boosts SGD performance without momentum; and (2) applying first-order momentum only to the output layer, where gradient variance is highest. Combining these two techniques lead to SCALE (Stochastic Column-normAlized Last-layer momEntum), a simple optimizer for memory efficient pretraining. Across multiple LLaMA models (60M-1B), SCALE matches or exceeds the performance of Adam while using only 35-45% of the total memory. It also consistently outperforms memory-efficient optimizers such as GaLore, Fira and APOLLO, making it a strong candidate for large-scale pretraining under memory constraints. For LLaMA 7B model, SCALE outperforms the state-of-the-art memory-efficient methods APOLLO and Muon, in terms of both perplexity and memory consumption.

Paper Structure

This paper contains 19 sections, 5 theorems, 56 equations, 7 figures, 11 tables, 1 algorithm.

Key Result

Theorem 2.1

Suppose $\ell(\theta)$ in eq:finite_sum is lower bounded by $\ell^*$, $\gamma$-smooth (i.e. $\nabla \ell(\theta)$ is Lipschitz continuous with constant $\gamma$), also the stochastic gradient is unbiased $\mathbb{E}_{\xi_t}\nabla_{\theta_l} \ell(\theta^t;\xi_t) = \nabla_{\theta_l} \ell(\theta^t)$ an where $\nabla_l^t=\nabla_{\theta_l} \ell(\theta^t;\xi_t)$ and $\Delta_1=\ell(\theta^1) - \ell^*$.

Figures (7)

  • Figure 1: Perplexity v.s. memory consumption among a number of SOTA algorithms. Solutions achieved towards the left-bottom side of the plot represent better performance/memory trade-off (see Appendix \ref{['appendix:memory_estimation']} for the details of the memory estimation).
  • Figure 2: Comparison of SGD and Adam training loss and evaluation perplexity curves on LLaMA 130M model. Clearly, SGD is not converging to any reasonable level of perplexity. The Adam and SGD learning rates are 3e-3 and 0.1, respectively. We search with multiple learning rates for SGD, for lower ones the loss decreases even slower and higher ones cause the training to diverge.
  • Figure 3: We present the histograms of the LM-head gradients after applying row-wise (a) and column-wise normalization (b). The gradients are from the 1000th training iteration of a LLaMA 130M model. It can be seen from figure (a) that row-wise results into some very high gradient values (up-to range 150 in absolute value) that we find to destabilize training.
  • Figure 4: Estimated variance of the stochastic gradients (and momentum when applicable) for different layers in two methods (smoothed by 50 iterations window). We observe that when running SGD with column-wise normalization (SGD-col-norm, left plot), the variance of the last layer (lm_head) is largest for most of the time, following by the variance of the first layer (embedding) and other layers. After applying momentum to the last layer (SGD-col-norm-mmt-last, right plot), the variance of the momentum of last layer (lm_head momentum) decreases to a very low level. Interestingly, the variance of the first layer in plot (b) is also smaller than the one in plot (a).
  • Figure 5: Learning rate sensitivity analysis, comparing Stable-SPAM (a stabilized version of Adam) and our method. Results from the 130M LLaMA model.
  • ...and 2 more figures

Theorems & Definitions (6)

  • Theorem 2.1
  • Remark 2.1
  • Lemma A.1
  • Lemma A.2
  • Lemma A.3
  • Proposition A.1