Table of Contents
Fetching ...

Adam-mini: Use Fewer Learning Rates To Gain More

Yushun Zhang, Congliang Chen, Ziniu Li, Tian Ding, Chenwei Wu, Diederik P. Kingma, Yinyu Ye, Zhi-Quan Luo, Ruoyu Sun

TL;DR

This paper tackles the high memory cost of Adam-based optimizers in training large language models by introducing Adam-mini, a Hessian-structure aware optimizer. By partitioning parameters into blocks aligned with the smallest dense Hessian sub-blocks and assigning a single learning rate per block computed from the block’s average v, Adam-mini dramatically reduces optimizer memory (over 99.9% of v) while maintaining or improving performance relative to AdamW. Empirical results across GPT-2, Llama, SFT, and RLHF demonstrate competitive or superior results with about a 50% memory reduction and notable throughput gains (up to ~49.6% higher throughput on 2× A800-80GB). The work highlights the value of leveraging Hessian structure for memory-efficient optimization and outlines avenues for refining blockwise learning-rate design and broader applicability beyond LLMs.

Abstract

We propose Adam-mini, an optimizer that achieves on par or better performance than AdamW with 50% less memory footprint. Adam-mini reduces memory by cutting down the learning rate resources in Adam (i.e., $1/\sqrt{v}$). By investigating the Hessian structure of neural nets, we find Adam's $v$ might not function at its full potential as effectively as we expected. We find that $\geq$ 99.9% of these learning rates in $v$ could be harmlessly removed if we (1) carefully partition the parameters into blocks following our new principle on Hessian structure; (2) assign a single but good learning rate to each parameter block. We then provide one simple way to find good learning rates and propose Adam-mini. Empirically, we verify that Adam-mini performs on par or better than AdamW on various language models sized from 39M to 13B for pre-training, supervised fine-tuning, and RLHF. The reduced memory footprint of Adam-mini also alleviates communication overheads among GPUs, thereby increasing throughput. For instance, Adam-mini achieves 49.6% higher throughput than AdamW when pre-training Llama 2-7B on $2\times$ A800-80GB GPUs, which saves 33% wall-clock time for pre-training.

Adam-mini: Use Fewer Learning Rates To Gain More

TL;DR

This paper tackles the high memory cost of Adam-based optimizers in training large language models by introducing Adam-mini, a Hessian-structure aware optimizer. By partitioning parameters into blocks aligned with the smallest dense Hessian sub-blocks and assigning a single learning rate per block computed from the block’s average v, Adam-mini dramatically reduces optimizer memory (over 99.9% of v) while maintaining or improving performance relative to AdamW. Empirical results across GPT-2, Llama, SFT, and RLHF demonstrate competitive or superior results with about a 50% memory reduction and notable throughput gains (up to ~49.6% higher throughput on 2× A800-80GB). The work highlights the value of leveraging Hessian structure for memory-efficient optimization and outlines avenues for refining blockwise learning-rate design and broader applicability beyond LLMs.

Abstract

We propose Adam-mini, an optimizer that achieves on par or better performance than AdamW with 50% less memory footprint. Adam-mini reduces memory by cutting down the learning rate resources in Adam (i.e., ). By investigating the Hessian structure of neural nets, we find Adam's might not function at its full potential as effectively as we expected. We find that 99.9% of these learning rates in could be harmlessly removed if we (1) carefully partition the parameters into blocks following our new principle on Hessian structure; (2) assign a single but good learning rate to each parameter block. We then provide one simple way to find good learning rates and propose Adam-mini. Empirically, we verify that Adam-mini performs on par or better than AdamW on various language models sized from 39M to 13B for pre-training, supervised fine-tuning, and RLHF. The reduced memory footprint of Adam-mini also alleviates communication overheads among GPUs, thereby increasing throughput. For instance, Adam-mini achieves 49.6% higher throughput than AdamW when pre-training Llama 2-7B on A800-80GB GPUs, which saves 33% wall-clock time for pre-training.
Paper Structure (47 sections, 4 equations, 24 figures, 9 tables, 7 algorithms)

This paper contains 47 sections, 4 equations, 24 figures, 9 tables, 7 algorithms.

Figures (24)

  • Figure 1: Results for Llama 2-7B pre-training. (a) Adam-mini takes less memory and can reach higher throughput (# tokens per second). The throughput is tested on 2$\times$ A800-80GB GPUs. (b, c) Adam-mini performs on-par with AdamW, but takes 33% less time to process the same # tokens.
  • Figure 2: An illustration of Adam-mini. Adam-mini assigns learning rates (lrs) by Hessian structure. It uses more lrs than SGD but fewer than Adam.
  • Figure 3: The near-block-diagonal Hessian structure of neural nets. (a) is the Hessian of an MLP after 1 training step reported in collobert2004large. (b,c,d): the Hessians of a 1-hidden-layer MLP on CIFAR-100. The near-block-diagonal structure maintains throughout training, where each block corresponds to one neuron.
  • Figure 4: (a): The Hessian of a three-block random quadratic problem. (b): Training curves for the problem associated with the full Hessian in (a). The optimal single (blockwise) learning rate is chosen based on the full (blockwise) Hessian in (a). (c): The 1st dense Hessian sub-blocks in (a). (d): Training curves for the new problem associated with the Hessian in (c).
  • Figure 5: The effectiveness of Adam's preconditioner $D_{\text{Adam}}$ on different matrix structures of $H_b$. (a): for most dimension $d$, $r$ is large when $\tau$ is small ( $r$ and $\tau$ are defined in Eq. (\ref{['eq_off_diag_ratio']})). This indicates that Adam might not be so effective when $H_b$ is dense. We fix $\kappa(H_b) = 500$ here. (b): We use the same setups as (a), except that we fix the dimension $d = 50$ and change the $x$-axis to $\kappa(H_b)$.
  • ...and 19 more figures