Table of Contents
Fetching ...

Stable-SPAM: How to Train in 4-Bit More Stably than 16-Bit Adam

Tianjin Huang, Haotian Hu, Zhenyu Zhang, Gaojie Jin, Xiang Li, Li Shen, Tianlong Chen, Lu Liu, Qingsong Wen, Zhangyang Wang, Shiwei Liu

TL;DR

The paper addresses the instability and LR sensitivity of 4-bit training for large language models. It introduces Stable-SPAM, a stabilized spike-aware optimizer that combines adaptive gradient normalization (AdaGN), adaptive spike-aware clipping (AdaClip), and momentum reset (MoRet) to tame gradient-norm spikes. Empirically, Stable-SPAM delivers superior performance over Adam and SPAM across INT4/FP4 and BF16 settings, often matching or surpassing BF16 results with fewer training tokens. This work demonstrates that robust low-bit optimization is feasible for large models, enabling significant memory and compute savings without sacrificing performance, and it provides a broadly applicable approach to stabilize other optimizers as well.

Abstract

This paper comprehensively evaluates several recently proposed optimizers for 4-bit training, revealing that low-bit precision amplifies sensitivity to learning rates and often causes unstable gradient norms, leading to divergence at higher learning rates. Among these, SPAM, a recent optimizer featuring momentum reset and spike-aware gradient clipping, achieves the best performance across various bit levels, but struggles to stabilize gradient norms, requiring careful learning rate tuning. To address these limitations, we propose Stable-SPAM, which incorporates enhanced gradient normalization and clipping techniques. In particular, Stable-SPAM (1) adaptively updates the clipping threshold for spiked gradients by tracking their historical maxima; (2) normalizes the entire gradient matrix based on its historical $l_2$-norm statistics; and $(3)$ inherits momentum reset from SPAM to periodically reset the first and second moments of Adam, mitigating the accumulation of spiked gradients. Extensive experiments show that Stable-SPAM effectively stabilizes gradient norms in 4-bit LLM training, delivering superior performance compared to Adam and SPAM. Notably, our 4-bit LLaMA-1B model trained with Stable-SPAM outperforms the BF16 LLaMA-1B trained with Adam by up to $2$ perplexity. Furthermore, when both models are trained in 4-bit, Stable-SPAM achieves the same loss as Adam while requiring only about half the training steps. Code is available at https://github.com/TianjinYellow/StableSPAM.git.

Stable-SPAM: How to Train in 4-Bit More Stably than 16-Bit Adam

TL;DR

The paper addresses the instability and LR sensitivity of 4-bit training for large language models. It introduces Stable-SPAM, a stabilized spike-aware optimizer that combines adaptive gradient normalization (AdaGN), adaptive spike-aware clipping (AdaClip), and momentum reset (MoRet) to tame gradient-norm spikes. Empirically, Stable-SPAM delivers superior performance over Adam and SPAM across INT4/FP4 and BF16 settings, often matching or surpassing BF16 results with fewer training tokens. This work demonstrates that robust low-bit optimization is feasible for large models, enabling significant memory and compute savings without sacrificing performance, and it provides a broadly applicable approach to stabilize other optimizers as well.

Abstract

This paper comprehensively evaluates several recently proposed optimizers for 4-bit training, revealing that low-bit precision amplifies sensitivity to learning rates and often causes unstable gradient norms, leading to divergence at higher learning rates. Among these, SPAM, a recent optimizer featuring momentum reset and spike-aware gradient clipping, achieves the best performance across various bit levels, but struggles to stabilize gradient norms, requiring careful learning rate tuning. To address these limitations, we propose Stable-SPAM, which incorporates enhanced gradient normalization and clipping techniques. In particular, Stable-SPAM (1) adaptively updates the clipping threshold for spiked gradients by tracking their historical maxima; (2) normalizes the entire gradient matrix based on its historical -norm statistics; and inherits momentum reset from SPAM to periodically reset the first and second moments of Adam, mitigating the accumulation of spiked gradients. Extensive experiments show that Stable-SPAM effectively stabilizes gradient norms in 4-bit LLM training, delivering superior performance compared to Adam and SPAM. Notably, our 4-bit LLaMA-1B model trained with Stable-SPAM outperforms the BF16 LLaMA-1B trained with Adam by up to perplexity. Furthermore, when both models are trained in 4-bit, Stable-SPAM achieves the same loss as Adam while requiring only about half the training steps. Code is available at https://github.com/TianjinYellow/StableSPAM.git.

Paper Structure

This paper contains 16 sections, 2 equations, 10 figures, 7 tables.

Figures (10)

  • Figure 1: Performance of 4-bit LLM training. Experiments are conducted with LLaMA-130M/350M/1B models on C4 Dataset. Adam-BF16 denotes that the model is trained with BF16 by Adam. Perplexity on validation set is reported.
  • Figure 1: Comparison of various optimizers of INT4 and FP4 training of LLaMA models on C$4$. Perplexity is reported.
  • Figure 2: Final validation loss when training LLaMA-130M on C4, sweeping across learning rates (LR). The vertical dotted line indicates that the model cannot be trained further as increasing the learning rate, i.e. Training loss becomes NaN. Red dashed horizontal lines indicate the best performance achieved.
  • Figure 3: Effect of SpikeClip huang2025spam on stabilizing training. Left: gradient norms before and after performing gradient spike clip. Right: training loss with and without gradient spike clip. Models are trained by Adam optimizer based on LLaMA-130M and C4.
  • Figure 4: Training loss and gradient norm of Adam using various learning rates with BF16 and FP4 precision. Experiments are conducted under the same training configuration with LLaMA-130M/350M.
  • ...and 5 more figures