Table of Contents
Fetching ...

Lean and Mean Adaptive Optimization via Subset-Norm and Subspace-Momentum with Convergence Guarantees

Thien Hang Nguyen, Huy Le Nguyen

TL;DR

This work tackles the memory bottlenecks of adaptive optimizers in large-scale neural networks by introducing Subset-Norm (SN) and Subspace-Momentum (SM). SN reduces AdaGrad-style memory from $O(d)$ to $O(\sqrt{d})$ by sharing step sizes across parameter subsets, with a high-probability convergence guarantee under coordinate-wise sub-Gaussian noise; SM confines momentum to a low-dimensional subspace of dimension $k$ while performing SGD in the orthogonal complement, with a convergence guarantee similar to SGD. The combination, SNSM, further reduces memory to roughly $k+\sqrt{d}$ and yields practical gains in LLM pretraining and fine-tuning, including substantial memory savings and faster attainment of competitive perplexities. Theoretical results are complemented by extensive experiments on LLaMA-scale models, showing that SN/SM not only saves memory but can improve training efficiency and perplexity with minimal hyperparameter tuning. Overall, the paper provides a principled, convergent framework for memory-efficient optimization in deep learning, with strong empirical support for real-world large-scale training settings.

Abstract

We introduce two complementary techniques for efficient optimization that reduce memory requirements while accelerating training of large-scale neural networks. The first technique, Subset-Norm step size, generalizes AdaGrad-Norm and AdaGrad(-Coordinate) through step-size sharing. Subset-Norm (SN) reduces AdaGrad's memory footprint from $O(d)$ to $O(\sqrt{d})$, where $d$ is the model size. For non-convex smooth objectives under coordinate-wise sub-gaussian noise, we show a noise-adapted high-probability convergence guarantee with improved dimensional dependence of SN over existing methods. Our second technique, Subspace-Momentum, reduces the momentum state's memory footprint by restricting momentum to a low-dimensional subspace while performing SGD in the orthogonal complement. We prove a high-probability convergence result for Subspace-Momentum under standard assumptions. Empirical evaluation on pre-training and fine-tuning LLMs demonstrates the effectiveness of our methods. For instance, combining Subset-Norm with Subspace-Momentum achieves Adam's validation perplexity for LLaMA 1B in approximately half the training tokens (6.8B vs 13.1B) while reducing Adam's optimizer-states memory footprint by more than 80\% with minimal additional hyperparameter tuning.

Lean and Mean Adaptive Optimization via Subset-Norm and Subspace-Momentum with Convergence Guarantees

TL;DR

This work tackles the memory bottlenecks of adaptive optimizers in large-scale neural networks by introducing Subset-Norm (SN) and Subspace-Momentum (SM). SN reduces AdaGrad-style memory from to by sharing step sizes across parameter subsets, with a high-probability convergence guarantee under coordinate-wise sub-Gaussian noise; SM confines momentum to a low-dimensional subspace of dimension while performing SGD in the orthogonal complement, with a convergence guarantee similar to SGD. The combination, SNSM, further reduces memory to roughly and yields practical gains in LLM pretraining and fine-tuning, including substantial memory savings and faster attainment of competitive perplexities. Theoretical results are complemented by extensive experiments on LLaMA-scale models, showing that SN/SM not only saves memory but can improve training efficiency and perplexity with minimal hyperparameter tuning. Overall, the paper provides a principled, convergent framework for memory-efficient optimization in deep learning, with strong empirical support for real-world large-scale training settings.

Abstract

We introduce two complementary techniques for efficient optimization that reduce memory requirements while accelerating training of large-scale neural networks. The first technique, Subset-Norm step size, generalizes AdaGrad-Norm and AdaGrad(-Coordinate) through step-size sharing. Subset-Norm (SN) reduces AdaGrad's memory footprint from to , where is the model size. For non-convex smooth objectives under coordinate-wise sub-gaussian noise, we show a noise-adapted high-probability convergence guarantee with improved dimensional dependence of SN over existing methods. Our second technique, Subspace-Momentum, reduces the momentum state's memory footprint by restricting momentum to a low-dimensional subspace while performing SGD in the orthogonal complement. We prove a high-probability convergence result for Subspace-Momentum under standard assumptions. Empirical evaluation on pre-training and fine-tuning LLMs demonstrates the effectiveness of our methods. For instance, combining Subset-Norm with Subspace-Momentum achieves Adam's validation perplexity for LLaMA 1B in approximately half the training tokens (6.8B vs 13.1B) while reducing Adam's optimizer-states memory footprint by more than 80\% with minimal additional hyperparameter tuning.

Paper Structure

This paper contains 67 sections, 7 theorems, 106 equations, 18 figures, 18 tables, 6 algorithms.

Key Result

Theorem 3.1

Suppose that $f:\mathbb{R}^{d}\rightarrow\mathbb{R}$ is $L$-smooth and lower bounded by $f_{*}$. Given unbiased stochastic gradients $\widehat{\nabla}f(x_{t})$ with stochastic gradient noise $\xi_{t}:=\widehat{\nabla}f(x_{t})-\nabla f(x_{t})$ that is $\sigma_{i}$-per-coordinate subgaussian for $i\in

Figures (18)

  • Figure 1: Validation perplexity for Adam, GaLore zhao2024galore, AdamSN, and AdamSNSM (ours) during LLaMA 1B model training for 13.1B tokens (100K steps). Optimizer memory footprint is shown in parentheses. Adam achieves a perplexity of 16.00 at 100,000 steps, while AdamSN and AdamSNSM exhibit lower perplexity earlier in training at 58,000 and 48,000 steps.
  • Figure 2: AdaGrad variants: Coordinate, Subset-Norm, and Norm. Subset-Norm generalizes Coordinate ($k=1$) and Norm ($k=d$).
  • Figure 3: Noise density per parameter across layers for LLaMA 60M on pre-training task after 100 steps.
  • Figure 4: Subspace Momentum Illustration.
  • Figure 5: Subset size ablation for AdamSN on LLaMA 60M trained for 1.38B tokens (batch size of 512 of max length 256 for 10,000 steps). The higher the subset size, the smaller the memory footprint of the second moment optimizer state.
  • ...and 13 more figures

Theorems & Definitions (13)

  • Theorem 3.1
  • Theorem 4.1
  • Lemma 4.1
  • Remark 4.2
  • proof
  • Lemma 4.3: Lemma A.1 from liu2023high
  • Corollary 4.4
  • proof
  • Lemma 4.5
  • proof
  • ...and 3 more