LDAdam: Adaptive Optimization from Low-Dimensional Gradient Statistics
Thomas Robert, Mher Safaryan, Ionut-Vlad Modoranu, Dan Alistarh
TL;DR
LDAdam addresses the memory bottleneck of adaptive optimizers by performing optimization steps in a low-dimensional gradient subspace while ensuring exploration of the full parameter space. It introduces a projection-aware update rule to transfer momentum across changing subspaces, a block PowerSGD-based subspace estimator, and a generalized error feedback mechanism to compensate for projection-induced loss, all with convergence guarantees under standard assumptions. The approach achieves comparable accuracy to Adam with a fraction of the optimizer-state memory and outperforms GaLore under similar memory budgets across RoBERTa/GSM8K/C4-Llama tasks, while maintaining favorable runtime. This work highlights memory-efficient optimization as a viable path for scalable training of large models without sacrificing theoretical guarantees or practical performance, with potential extensions to distributed settings and further memory-reduction techniques.
Abstract
We introduce LDAdam, a memory-efficient optimizer for training large models, that performs adaptive optimization steps within lower dimensional subspaces, while consistently exploring the full parameter space during training. This strategy keeps the optimizer's memory footprint to a fraction of the model size. LDAdam relies on a new projection-aware update rule for the optimizer states that allows for transitioning between subspaces, i.e., estimation of the statistics of the projected gradients. To mitigate the errors due to low-rank projection, LDAdam integrates a new generalized error feedback mechanism, which explicitly accounts for both gradient and optimizer state compression. We prove the convergence of LDAdam under standard assumptions, and show that LDAdam allows for accurate and efficient fine-tuning and pre-training of language models. Code is available at https://github.com/IST-DASLab/LDAdam
