Adam-mini: Use Fewer Learning Rates To Gain More

Yushun Zhang; Congliang Chen; Ziniu Li; Tian Ding; Chenwei Wu; Diederik P. Kingma; Yinyu Ye; Zhi-Quan Luo; Ruoyu Sun

Adam-mini: Use Fewer Learning Rates To Gain More

Yushun Zhang, Congliang Chen, Ziniu Li, Tian Ding, Chenwei Wu, Diederik P. Kingma, Yinyu Ye, Zhi-Quan Luo, Ruoyu Sun

TL;DR

This paper tackles the high memory cost of Adam-based optimizers in training large language models by introducing Adam-mini, a Hessian-structure aware optimizer. By partitioning parameters into blocks aligned with the smallest dense Hessian sub-blocks and assigning a single learning rate per block computed from the block’s average v, Adam-mini dramatically reduces optimizer memory (over 99.9% of v) while maintaining or improving performance relative to AdamW. Empirical results across GPT-2, Llama, SFT, and RLHF demonstrate competitive or superior results with about a 50% memory reduction and notable throughput gains (up to ~49.6% higher throughput on 2× A800-80GB). The work highlights the value of leveraging Hessian structure for memory-efficient optimization and outlines avenues for refining blockwise learning-rate design and broader applicability beyond LLMs.

Abstract

We propose Adam-mini, an optimizer that achieves on par or better performance than AdamW with 50% less memory footprint. Adam-mini reduces memory by cutting down the learning rate resources in Adam (i.e., $1/\sqrt{v}$). By investigating the Hessian structure of neural nets, we find Adam's $v$ might not function at its full potential as effectively as we expected. We find that $\geq$ 99.9% of these learning rates in $v$ could be harmlessly removed if we (1) carefully partition the parameters into blocks following our new principle on Hessian structure; (2) assign a single but good learning rate to each parameter block. We then provide one simple way to find good learning rates and propose Adam-mini. Empirically, we verify that Adam-mini performs on par or better than AdamW on various language models sized from 39M to 13B for pre-training, supervised fine-tuning, and RLHF. The reduced memory footprint of Adam-mini also alleviates communication overheads among GPUs, thereby increasing throughput. For instance, Adam-mini achieves 49.6% higher throughput than AdamW when pre-training Llama 2-7B on $2\times$ A800-80GB GPUs, which saves 33% wall-clock time for pre-training.

Adam-mini: Use Fewer Learning Rates To Gain More

TL;DR

Abstract

). By investigating the Hessian structure of neural nets, we find Adam's

might not function at its full potential as effectively as we expected. We find that

99.9% of these learning rates in

could be harmlessly removed if we (1) carefully partition the parameters into blocks following our new principle on Hessian structure; (2) assign a single but good learning rate to each parameter block. We then provide one simple way to find good learning rates and propose Adam-mini. Empirically, we verify that Adam-mini performs on par or better than AdamW on various language models sized from 39M to 13B for pre-training, supervised fine-tuning, and RLHF. The reduced memory footprint of Adam-mini also alleviates communication overheads among GPUs, thereby increasing throughput. For instance, Adam-mini achieves 49.6% higher throughput than AdamW when pre-training Llama 2-7B on

A800-80GB GPUs, which saves 33% wall-clock time for pre-training.

Paper Structure (47 sections, 4 equations, 24 figures, 9 tables, 7 algorithms)

This paper contains 47 sections, 4 equations, 24 figures, 9 tables, 7 algorithms.

Introduction
Method
Motivations and Observations
Proposed Method: Adam-mini
Principle for the Partition Strategy
Some Characteristics of Adam-mini and Discussions
Experiments
Pre-training
Scaling Laws of Adam-mini
Supervised Fine-tuning and RLHF
Detailed Comparison with Adafactor
Concluding Remarks
Related works
The Complete Form of Adam-mini
More Discussions
...and 32 more sections

Figures (24)

Figure 1: Results for Llama 2-7B pre-training. (a) Adam-mini takes less memory and can reach higher throughput (# tokens per second). The throughput is tested on 2$\times$ A800-80GB GPUs. (b, c) Adam-mini performs on-par with AdamW, but takes 33% less time to process the same # tokens.
Figure 2: An illustration of Adam-mini. Adam-mini assigns learning rates (lrs) by Hessian structure. It uses more lrs than SGD but fewer than Adam.
Figure 3: The near-block-diagonal Hessian structure of neural nets. (a) is the Hessian of an MLP after 1 training step reported in collobert2004large. (b,c,d): the Hessians of a 1-hidden-layer MLP on CIFAR-100. The near-block-diagonal structure maintains throughout training, where each block corresponds to one neuron.
Figure 4: (a): The Hessian of a three-block random quadratic problem. (b): Training curves for the problem associated with the full Hessian in (a). The optimal single (blockwise) learning rate is chosen based on the full (blockwise) Hessian in (a). (c): The 1st dense Hessian sub-blocks in (a). (d): Training curves for the new problem associated with the Hessian in (c).
Figure 5: The effectiveness of Adam's preconditioner $D_{\text{Adam}}$ on different matrix structures of $H_b$. (a): for most dimension $d$, $r$ is large when $\tau$ is small ( $r$ and $\tau$ are defined in Eq. (\ref{['eq_off_diag_ratio']})). This indicates that Adam might not be so effective when $H_b$ is dense. We fix $\kappa(H_b) = 500$ here. (b): We use the same setups as (a), except that we fix the dimension $d = 50$ and change the $x$-axis to $\kappa(H_b)$.
...and 19 more figures

Adam-mini: Use Fewer Learning Rates To Gain More

TL;DR

Abstract

Adam-mini: Use Fewer Learning Rates To Gain More

Authors

TL;DR

Abstract

Table of Contents

Figures (24)