Table of Contents
Fetching ...

When Can You Get Away with Low Memory Adam?

Dayal Singh Kalra, John Kirchenbauer, Maissam Barkeshli, Tom Goldstein

TL;DR

The paper tackles the memory overhead of Adam by introducing SlimAdam, a low-memory variant guided by a layer-wise Signal-to-Noise Ratio (SNR) analysis that decides when second-moment entries can be replaced by their means. By computing $SNR_K(V_t) = \mathbb{E}_{K'}\left[ (\mathbb{E}_K[V_t])^2 / \mathrm{Var}_K[V_t] \right]$, the authors identify per-layer compression strategies across dimensions such as $\text{fan}_{\text{in}}$ and $\text{fan}_{\text{out}}$, and demonstrate that SlimAdam achieves Adam-like performance and stability while saving up to $98\%$ of second moments on GPT/ViT-style models. The analysis reveals architecture- and regime-dependent compressibility: token-embedding and LM-head layers resist token-dimension compression, attention keys/queries favor $\text{fan}_{\text{in}}$ compression, while values/projections favor $\text{fan}_{\text{out}}$; heavy-tailed token distributions and large learning rates can reduce compressibility, whereas initialization can boost SNR in certain layers. Empirically, SlimAdam matches Adam across diverse tasks and models, offering substantial memory savings and practical guidance for when to deploy low-memory optimizers in real-world training, with rules that transfer across dataset and model scales.

Abstract

Adam is the go-to optimizer for training modern machine learning models, but it requires additional memory to maintain the moving averages of the gradients and their squares. While various low-memory optimizers have been proposed that sometimes match the performance of Adam, their lack of reliability has left Adam as the default choice. In this work, we apply a simple layer-wise Signal-to-Noise Ratio (SNR) analysis to quantify when second-moment tensors can be effectively replaced by their means across different dimensions. Our SNR analysis reveals how architecture, training hyperparameters, and dataset properties impact compressibility along Adam's trajectory, naturally leading to $\textit{SlimAdam}$, a memory-efficient Adam variant. $\textit{SlimAdam}$ compresses the second moments along dimensions with high SNR when feasible, and leaves when compression would be detrimental. Through experiments across a diverse set of architectures and training scenarios, we show that $\textit{SlimAdam}$ matches Adam's performance and stability while saving up to $98\%$ of total second moments. Code for $\textit{SlimAdam}$ is available at https://github.com/dayal-kalra/low-memory-adam.

When Can You Get Away with Low Memory Adam?

TL;DR

The paper tackles the memory overhead of Adam by introducing SlimAdam, a low-memory variant guided by a layer-wise Signal-to-Noise Ratio (SNR) analysis that decides when second-moment entries can be replaced by their means. By computing , the authors identify per-layer compression strategies across dimensions such as and , and demonstrate that SlimAdam achieves Adam-like performance and stability while saving up to of second moments on GPT/ViT-style models. The analysis reveals architecture- and regime-dependent compressibility: token-embedding and LM-head layers resist token-dimension compression, attention keys/queries favor compression, while values/projections favor ; heavy-tailed token distributions and large learning rates can reduce compressibility, whereas initialization can boost SNR in certain layers. Empirically, SlimAdam matches Adam across diverse tasks and models, offering substantial memory savings and practical guidance for when to deploy low-memory optimizers in real-world training, with rules that transfer across dataset and model scales.

Abstract

Adam is the go-to optimizer for training modern machine learning models, but it requires additional memory to maintain the moving averages of the gradients and their squares. While various low-memory optimizers have been proposed that sometimes match the performance of Adam, their lack of reliability has left Adam as the default choice. In this work, we apply a simple layer-wise Signal-to-Noise Ratio (SNR) analysis to quantify when second-moment tensors can be effectively replaced by their means across different dimensions. Our SNR analysis reveals how architecture, training hyperparameters, and dataset properties impact compressibility along Adam's trajectory, naturally leading to , a memory-efficient Adam variant. compresses the second moments along dimensions with high SNR when feasible, and leaves when compression would be detrimental. Through experiments across a diverse set of architectures and training scenarios, we show that matches Adam's performance and stability while saving up to of total second moments. Code for is available at https://github.com/dayal-kalra/low-memory-adam.

Paper Structure

This paper contains 48 sections, 4 equations, 32 figures, 3 tables.

Figures (32)

  • Figure 1: Comparison of common low-memory optimizers on GPT pre-training task using Fineweb-Edu dataset. SlimAdam matches Adam's performance with a nearly identical U-shaped loss curve.
  • Figure 2: SNR trajectories of selected second-moment blocks of GPT-small model trained on OpenWebText. Different compression dimensions are denoted as: $K = 0$ for $\text{fan}_{\text{out}}$, $K = 1$ for $\text{fan}_{\text{in}}$, and $K = (0, 1)$ for both dimensions.
  • Figure 3: Depth dependence of average SNR values for different second-moment blocks of the GPT-small model trained on OpenWebText. The experimental setup is the same as in \ref{['fig:snr-curves-gpt-small-openweb']}.
  • Figure 4: SNR trends for selected layers of pre-trained Llama 3.2 1B fine-tuned on Alpaca dataset. For detailed results, see \ref{['appendix:snr-finetuning']}.
  • Figure 5: SNR trends of ResNet-18 trained on CIFAR-100: (left) layer dependence of averaged SNR values on the intermediate convolutional layers, (right) SNR trajectories of the final layer.
  • ...and 27 more figures