Table of Contents
Fetching ...

APOLLO: SGD-like Memory, AdamW-level Performance

Hanqing Zhu, Zhenyu Zhang, Wenyan Cong, Xi Liu, Sem Park, Vikas Chandra, Bo Long, David Z. Pan, Zhangyang Wang, Jinwon Lee

TL;DR

<3-5 sentence high-level summary> APOLLO tackles the memory bottleneck of AdamW in large language model training by introducing a structured, SGD-like learning rate update that can be implemented without costly SVDs. It uses an auxiliary low-rank space and pure random projections to approximate channel-wise gradient scaling, dramatically reducing optimizer memory while maintaining or improving training perplexity and downstream performance. The approach yields system-level gains, including up to 3x throughput and the ability to pre-train or fine-tune large models on modest hardware, especially when combined with weight quantization in APOLLO-Mini. Overall, APOLLO and its Mini variant offer a practical, scalable path to memory-efficient, high-performance LLM optimization, broadening access to large-scale pre-training and fine-tuning.</3-5 sentence high-level summary>

Abstract

Large language models (LLMs) are notoriously memory-intensive during training, particularly with the popular AdamW optimizer. This memory burden necessitates using more or higher-end GPUs or reducing batch sizes, limiting training scalability and throughput. To address this, various memory-efficient optimizers have been proposed to reduce optimizer memory usage. However, they face critical challenges: (i) reliance on costly SVD operations; (ii) significant performance trade-offs compared to AdamW; and (iii) still substantial optimizer memory overhead to maintain competitive performance. In this work, we identify that AdamW's learning rate adaptation rule can be effectively coarsened as a structured learning rate update. Based on this insight, we propose Approximated Gradient Scaling for Memory-Efficient LLM Optimization (APOLLO), which approximates learning rate scaling using an auxiliary low-rank optimizer state based on pure random projection. This structured learning rate update rule makes APOLLO highly tolerant to further memory reductions while delivering comparable pre-training performance. Even its rank-1 variant, APOLLO-Mini, achieves superior pre-training performance compared to AdamW with SGD-level memory costs. Extensive experiments demonstrate that the APOLLO series performs on-par with or better than AdamW, while achieving greater memory savings by nearly eliminating the optimization states of AdamW. These savings provide significant system-level benefits: (1) Enhanced Throughput: 3x throughput on an 8xA100-80GB setup compared to AdamW by supporting 4x larger batch sizes. (2) Improved Model Scalability: Pre-training LLaMA-13B with naive DDP on A100-80GB GPUs without system-level optimizations. (3) Low-End GPU Friendly Pre-training: Pre-training LLaMA-7B on a single GPU using less than 12 GB of memory with weight quantization.

APOLLO: SGD-like Memory, AdamW-level Performance

TL;DR

<3-5 sentence high-level summary> APOLLO tackles the memory bottleneck of AdamW in large language model training by introducing a structured, SGD-like learning rate update that can be implemented without costly SVDs. It uses an auxiliary low-rank space and pure random projections to approximate channel-wise gradient scaling, dramatically reducing optimizer memory while maintaining or improving training perplexity and downstream performance. The approach yields system-level gains, including up to 3x throughput and the ability to pre-train or fine-tune large models on modest hardware, especially when combined with weight quantization in APOLLO-Mini. Overall, APOLLO and its Mini variant offer a practical, scalable path to memory-efficient, high-performance LLM optimization, broadening access to large-scale pre-training and fine-tuning.</3-5 sentence high-level summary>

Abstract

Large language models (LLMs) are notoriously memory-intensive during training, particularly with the popular AdamW optimizer. This memory burden necessitates using more or higher-end GPUs or reducing batch sizes, limiting training scalability and throughput. To address this, various memory-efficient optimizers have been proposed to reduce optimizer memory usage. However, they face critical challenges: (i) reliance on costly SVD operations; (ii) significant performance trade-offs compared to AdamW; and (iii) still substantial optimizer memory overhead to maintain competitive performance. In this work, we identify that AdamW's learning rate adaptation rule can be effectively coarsened as a structured learning rate update. Based on this insight, we propose Approximated Gradient Scaling for Memory-Efficient LLM Optimization (APOLLO), which approximates learning rate scaling using an auxiliary low-rank optimizer state based on pure random projection. This structured learning rate update rule makes APOLLO highly tolerant to further memory reductions while delivering comparable pre-training performance. Even its rank-1 variant, APOLLO-Mini, achieves superior pre-training performance compared to AdamW with SGD-level memory costs. Extensive experiments demonstrate that the APOLLO series performs on-par with or better than AdamW, while achieving greater memory savings by nearly eliminating the optimization states of AdamW. These savings provide significant system-level benefits: (1) Enhanced Throughput: 3x throughput on an 8xA100-80GB setup compared to AdamW by supporting 4x larger batch sizes. (2) Improved Model Scalability: Pre-training LLaMA-13B with naive DDP on A100-80GB GPUs without system-level optimizations. (3) Low-End GPU Friendly Pre-training: Pre-training LLaMA-7B on a single GPU using less than 12 GB of memory with weight quantization.

Paper Structure

This paper contains 71 sections, 6 theorems, 68 equations, 9 figures, 12 tables, 1 algorithm.

Key Result

Theorem 4.1

Approximated Channel-wise Momentum with a bound for its $\ell_2$ norm: $\textbf{G}_t \in {\mathbb{R}}^{m \times n}$ is the full-rank gradient ($m \leq n$). Let $\textbf{P}$ be a matrix of shape ${\mathbb{R}}^{r \times m}$ where each element is independently sampled from a standard Gaussian distribut

Figures (9)

  • Figure 1: (Left) Overview of our APOLLO optimizer; (Middle) Memory breakdown comparison for a single batch size, where both GaLore and our method employ the layer-wise gradient update strategy lv2023full. The (Q-) prefix indicates the integration of INT8 weight quantization, as utilized in zhang2024q; (Right) End-to-end training throughput on 8 A100-80GB GPUs.
  • Figure 2: Comparison of Validation perplexity on LLaMA-7B.
  • Figure 3: Training loss comparison between Element-wise and Channel-wise Learning Rate (LR) Adaptations with or without norm limiter (NL) on the LLaMA-130M model.
  • Figure 4: Visualization of the channel-wise scaling factor ratio for APOLLO with rank $1/8 n$ and $1/4 n$, compared with AdamW (full rank $n$). The empirical data aligns well with the theoretical ratios $1 : \sqrt{2} : 2\sqrt{2}$, validating the bounds across various layer types and stages on the LLaMA-350M model. More visualization can be found at Fig. \ref{['fig:scaling_factor_comparison']}.
  • Figure 5: (a-c) Comparison results of various optimization methods using singular value decomposition or random projection. The experiments were conducted on LLaMA-60M/130M/350M models for C4 pretraining tasks. (d) Validation perplexity with varying rank sizes, where 128 is one-quarter of the original model dimension. The red dashed line indicates the performance of full-rank AdamW.
  • ...and 4 more figures

Theorems & Definitions (10)

  • Theorem 4.1
  • Theorem 4.2
  • Theorem 1.1: Norm Preservation
  • proof
  • Theorem 1.2: First Moment Preservation
  • proof
  • Theorem 1.3: Second Moment Preservation
  • proof
  • Theorem 1.4: Main Result
  • proof