Table of Contents
Fetching ...

LDAdam: Adaptive Optimization from Low-Dimensional Gradient Statistics

Thomas Robert, Mher Safaryan, Ionut-Vlad Modoranu, Dan Alistarh

TL;DR

LDAdam addresses the memory bottleneck of adaptive optimizers by performing optimization steps in a low-dimensional gradient subspace while ensuring exploration of the full parameter space. It introduces a projection-aware update rule to transfer momentum across changing subspaces, a block PowerSGD-based subspace estimator, and a generalized error feedback mechanism to compensate for projection-induced loss, all with convergence guarantees under standard assumptions. The approach achieves comparable accuracy to Adam with a fraction of the optimizer-state memory and outperforms GaLore under similar memory budgets across RoBERTa/GSM8K/C4-Llama tasks, while maintaining favorable runtime. This work highlights memory-efficient optimization as a viable path for scalable training of large models without sacrificing theoretical guarantees or practical performance, with potential extensions to distributed settings and further memory-reduction techniques.

Abstract

We introduce LDAdam, a memory-efficient optimizer for training large models, that performs adaptive optimization steps within lower dimensional subspaces, while consistently exploring the full parameter space during training. This strategy keeps the optimizer's memory footprint to a fraction of the model size. LDAdam relies on a new projection-aware update rule for the optimizer states that allows for transitioning between subspaces, i.e., estimation of the statistics of the projected gradients. To mitigate the errors due to low-rank projection, LDAdam integrates a new generalized error feedback mechanism, which explicitly accounts for both gradient and optimizer state compression. We prove the convergence of LDAdam under standard assumptions, and show that LDAdam allows for accurate and efficient fine-tuning and pre-training of language models. Code is available at https://github.com/IST-DASLab/LDAdam

LDAdam: Adaptive Optimization from Low-Dimensional Gradient Statistics

TL;DR

LDAdam addresses the memory bottleneck of adaptive optimizers by performing optimization steps in a low-dimensional gradient subspace while ensuring exploration of the full parameter space. It introduces a projection-aware update rule to transfer momentum across changing subspaces, a block PowerSGD-based subspace estimator, and a generalized error feedback mechanism to compensate for projection-induced loss, all with convergence guarantees under standard assumptions. The approach achieves comparable accuracy to Adam with a fraction of the optimizer-state memory and outperforms GaLore under similar memory budgets across RoBERTa/GSM8K/C4-Llama tasks, while maintaining favorable runtime. This work highlights memory-efficient optimization as a viable path for scalable training of large models without sacrificing theoretical guarantees or practical performance, with potential extensions to distributed settings and further memory-reduction techniques.

Abstract

We introduce LDAdam, a memory-efficient optimizer for training large models, that performs adaptive optimization steps within lower dimensional subspaces, while consistently exploring the full parameter space during training. This strategy keeps the optimizer's memory footprint to a fraction of the model size. LDAdam relies on a new projection-aware update rule for the optimizer states that allows for transitioning between subspaces, i.e., estimation of the statistics of the projected gradients. To mitigate the errors due to low-rank projection, LDAdam integrates a new generalized error feedback mechanism, which explicitly accounts for both gradient and optimizer state compression. We prove the convergence of LDAdam under standard assumptions, and show that LDAdam allows for accurate and efficient fine-tuning and pre-training of language models. Code is available at https://github.com/IST-DASLab/LDAdam

Paper Structure

This paper contains 46 sections, 8 theorems, 86 equations, 5 figures, 12 tables, 3 algorithms.

Key Result

Theorem 1

Let Assumptions ass:smooth, ass:boundgrad and ass:var hold. Then, choosing step-size $\eta = \min(\frac{\epsilon}{4LC_0\sqrt{1+C_2}}, \frac{1}{\sqrt{T}})$, LDAdam (Algorithm alg:LDAdam_both) satisfies with constants $C_0\mathrel{\mathop:}= \sqrt{\frac{1+\beta_2}{1-\beta_2}\frac{(1 - \beta_1(1-q_r))^2}{(1-\beta_1)^2(1-q_r)^2}G^2 + \epsilon}$ and $C_2 = \frac{\beta_1 + (1-\beta_1)q_r^2}{(1-\beta_1)

Figures (5)

  • Figure 1: Pre-training dynamics for Llama 350M (left) and Llama 1.3B (right) on the C4 dataset.
  • Figure 2: Pre-training dynamics over time for Llama 350M (left) and Llama 1.3B (right) on the C4 dataset.
  • Figure 3: Throughput (token per second) and peak memory (GB) of Adam and LDAdam with respect to rank for pre-training the Llamma 350M model on the C4 dataset, on a single NVIDIA H100 80BG GPU, using micro batch size of 1.
  • Figure 4: Training dynamics and validation perplexity for various rank when pre-training Llama 350M model. For training dynamics we used a single learning rate of $5e-4$ to allow comparison between runs and provide results for the first $10000$ optimization steps. We report the best validation perplexity for learning rates tuned over the set $\{5e-4, 1e-3, 5e-3\}$.
  • Figure 5: Error buffer norm and gradient norm during the fine-tuning of the RoBERTa-base model on the GLUE benchmark.

Theorems & Definitions (15)

  • Theorem 1: Non-convex convergence rate
  • Theorem 2
  • Lemma 1
  • proof
  • Lemma 2
  • proof
  • Lemma 3
  • proof
  • Lemma 4
  • proof
  • ...and 5 more