A Memory Efficient Randomized Subspace Optimization Method for Training Large Language Models
Yiming Chen, Yuan Zhang, Yin Liu, Kun Yuan, Zaiwen Wen
TL;DR
This work tackles the memory bottlenecks in training large language models by jointly reducing activation and optimizer-state memory via Randomized Subspace Optimization (RSO), which solves a sequence of low-dimensional subproblems obtained from random projections. The authors provide a comprehensive convergence analysis under standard smoothness and projection assumptions and demonstrate that RSO achieves memory and communication savings while maintaining performance comparable to GaLore and Adam. Empirical results on LLaMA pre-training and RoBERTa GLUE fine-tuning show substantial reductions in memory footprint and faster iterations, especially at lower subspace ranks, with competitive perplexities and task accuracy. The approach offers practical benefits for scalable distributed training and opens avenues for further reduction of activation memory and exploration of second-order subproblem solvers.
Abstract
The memory challenges associated with training Large Language Models (LLMs) have become a critical concern, particularly when using the Adam optimizer. To address this issue, numerous memory-efficient techniques have been proposed, with GaLore standing out as a notable example designed to reduce the memory footprint of optimizer states. However, these approaches do not alleviate the memory burden imposed by activations, rendering them unsuitable for scenarios involving long context sequences or large mini-batches. Moreover, their convergence properties are still not well-understood in the literature. In this work, we introduce a Randomized Subspace Optimization framework for pre-training and fine-tuning LLMs. Our approach decomposes the high-dimensional training problem into a series of lower-dimensional subproblems. At each iteration, a random subspace is selected, and the parameters within that subspace are optimized. This structured reduction in dimensionality allows our method to simultaneously reduce memory usage for both activations and optimizer states. We establish comprehensive convergence guarantees and derive rates for various scenarios, accommodating different optimization strategies to solve the subproblems. Extensive experiments validate the superior memory and communication efficiency of our method, achieving performance comparable to GaLore and Adam.
