Table of Contents
Fetching ...

A Memory Efficient Randomized Subspace Optimization Method for Training Large Language Models

Yiming Chen, Yuan Zhang, Yin Liu, Kun Yuan, Zaiwen Wen

TL;DR

This work tackles the memory bottlenecks in training large language models by jointly reducing activation and optimizer-state memory via Randomized Subspace Optimization (RSO), which solves a sequence of low-dimensional subproblems obtained from random projections. The authors provide a comprehensive convergence analysis under standard smoothness and projection assumptions and demonstrate that RSO achieves memory and communication savings while maintaining performance comparable to GaLore and Adam. Empirical results on LLaMA pre-training and RoBERTa GLUE fine-tuning show substantial reductions in memory footprint and faster iterations, especially at lower subspace ranks, with competitive perplexities and task accuracy. The approach offers practical benefits for scalable distributed training and opens avenues for further reduction of activation memory and exploration of second-order subproblem solvers.

Abstract

The memory challenges associated with training Large Language Models (LLMs) have become a critical concern, particularly when using the Adam optimizer. To address this issue, numerous memory-efficient techniques have been proposed, with GaLore standing out as a notable example designed to reduce the memory footprint of optimizer states. However, these approaches do not alleviate the memory burden imposed by activations, rendering them unsuitable for scenarios involving long context sequences or large mini-batches. Moreover, their convergence properties are still not well-understood in the literature. In this work, we introduce a Randomized Subspace Optimization framework for pre-training and fine-tuning LLMs. Our approach decomposes the high-dimensional training problem into a series of lower-dimensional subproblems. At each iteration, a random subspace is selected, and the parameters within that subspace are optimized. This structured reduction in dimensionality allows our method to simultaneously reduce memory usage for both activations and optimizer states. We establish comprehensive convergence guarantees and derive rates for various scenarios, accommodating different optimization strategies to solve the subproblems. Extensive experiments validate the superior memory and communication efficiency of our method, achieving performance comparable to GaLore and Adam.

A Memory Efficient Randomized Subspace Optimization Method for Training Large Language Models

TL;DR

This work tackles the memory bottlenecks in training large language models by jointly reducing activation and optimizer-state memory via Randomized Subspace Optimization (RSO), which solves a sequence of low-dimensional subproblems obtained from random projections. The authors provide a comprehensive convergence analysis under standard smoothness and projection assumptions and demonstrate that RSO achieves memory and communication savings while maintaining performance comparable to GaLore and Adam. Empirical results on LLaMA pre-training and RoBERTa GLUE fine-tuning show substantial reductions in memory footprint and faster iterations, especially at lower subspace ranks, with competitive perplexities and task accuracy. The approach offers practical benefits for scalable distributed training and opens avenues for further reduction of activation memory and exploration of second-order subproblem solvers.

Abstract

The memory challenges associated with training Large Language Models (LLMs) have become a critical concern, particularly when using the Adam optimizer. To address this issue, numerous memory-efficient techniques have been proposed, with GaLore standing out as a notable example designed to reduce the memory footprint of optimizer states. However, these approaches do not alleviate the memory burden imposed by activations, rendering them unsuitable for scenarios involving long context sequences or large mini-batches. Moreover, their convergence properties are still not well-understood in the literature. In this work, we introduce a Randomized Subspace Optimization framework for pre-training and fine-tuning LLMs. Our approach decomposes the high-dimensional training problem into a series of lower-dimensional subproblems. At each iteration, a random subspace is selected, and the parameters within that subspace are optimized. This structured reduction in dimensionality allows our method to simultaneously reduce memory usage for both activations and optimizer states. We establish comprehensive convergence guarantees and derive rates for various scenarios, accommodating different optimization strategies to solve the subproblems. Extensive experiments validate the superior memory and communication efficiency of our method, achieving performance comparable to GaLore and Adam.

Paper Structure

This paper contains 27 sections, 3 theorems, 35 equations, 2 figures, 10 tables, 1 algorithm.

Key Result

Theorem 5.5

Under Assumptions ass-smooth and ass-orthogonal, let each subproblem in (eqa-outer-step) be solved starting from the initial point ${\boldsymbol{B}}^0 = {\bm{0}}$ to an expected $\epsilon$-inexact solution $\tilde{{\boldsymbol{B}}}^k$ with suitable choice of $\eta^k$. The sequence $\{ {\boldsymbol{W where $\Delta_0 := f({\boldsymbol{W}}^0) - f^*$ and $\hat{L} := \max_{\ell}\{m_\ell / r_\ell\}L$.

Figures (2)

  • Figure 1: Memory components for training LLaMA-1B model with Adam optimizer.
  • Figure 2: Comparison of peak memory usage (in GB) per device for RSO and GaLore during LLaMA training with varying ranks. All hyperparameters, except rank, are consistent with zhao2024galore. Adam's memory usage is reported for LLaMA-350M and LLaMA-1B but excluded for LLaMA-7B due to an out-of-memory (OOM) error.

Theorems & Definitions (7)

  • Definition 5.1: Expected $\epsilon$-inexact solution
  • Remark 5.4
  • Theorem 5.5
  • Lemma B.1
  • proof
  • Theorem B.2: Theorem \ref{['thm-rso']}
  • proof