A Minimalist Optimizer Design for LLM Pretraining
Athanasios Glentis, Jiaxiang Li, Andi Han, Mingyi Hong
TL;DR
This work tackles the memory overhead of adaptive optimizers like Adam in LLM pretraining by asking how to reach state-of-the-art performance with minimal SGD modifications. Through a bottom-up study, it identifies two high-signal components—column-wise gradient normalization and first-order momentum applied only to the last layer—that together yield the SCALE optimizer. SCALE matches or exceeds Adam-like performance while using only a fraction of the memory, and consistently outperforms other memory-efficient baselines across multiple LLaMA model sizes. The results establish SCALE as a practical, minimalistic baseline for memory-constrained pretraining and offer design principles for future optimizer research in large-scale language modeling.
Abstract
Training large language models (LLMs) typically relies on adaptive optimizers such as Adam, which introduce extra operations and require significant more memory to maintain first- and second-order moments than SGD. While recent works such as GaLore, Fira and APOLLO have proposed state-compressed variants to reduce memory consumption, a fundamental question remains: What are the minimum modifications to plain SGD needed to match state-of-the-art pretraining performance? We systematically investigate this question using a bottom-up approach, and identify two simple yet highly (memory- and compute-) efficient techniques: (1) column-wise gradient normalization (normalizing the gradient along the output dimension), which boosts SGD performance without momentum; and (2) applying first-order momentum only to the output layer, where gradient variance is highest. Combining these two techniques lead to SCALE (Stochastic Column-normAlized Last-layer momEntum), a simple optimizer for memory efficient pretraining. Across multiple LLaMA models (60M-1B), SCALE matches or exceeds the performance of Adam while using only 35-45% of the total memory. It also consistently outperforms memory-efficient optimizers such as GaLore, Fira and APOLLO, making it a strong candidate for large-scale pretraining under memory constraints. For LLaMA 7B model, SCALE outperforms the state-of-the-art memory-efficient methods APOLLO and Muon, in terms of both perplexity and memory consumption.
