Memory-Efficient LLM Training with Online Subspace Descent

Kaizhao Liang; Bo Liu; Lizhang Chen; Qiang Liu

Memory-Efficient LLM Training with Online Subspace Descent

Kaizhao Liang, Bo Liu, Lizhang Chen, Qiang Liu

TL;DR

The paper addresses memory-efficient training of large language models by introducing Online Subspace Descent, a SVD-free, online-PCA-based approach that dynamically updates a projection subspace during optimization. It establishes a convergence guarantee within the Hamiltonian Descent framework, showing that the Hamiltonian serves as a Lyapunov function even under arbitrary, continuous subspace updates. The proposed method yields improved perplexity on LLaMA pretraining across scales (60M–7B) on C4, with substantially lower overhead than SVD-based methods and competitive or superior downstream performance. This dynamic subspace approach narrows the gap to full-rank baselines while enabling scalable, memory-efficient optimization for large models.

Abstract

Recently, a wide range of memory-efficient LLM training algorithms have gained substantial popularity. These methods leverage the low-rank structure of gradients to project optimizer states into a subspace using projection matrix found by singular value decomposition (SVD). However, convergence of these algorithms is highly dependent on the update rules of their projection matrix. In this work, we provide the \emph{first} convergence guarantee for arbitrary update rules of projection matrix. This guarantee is generally applicable to optimizers that can be analyzed with Hamiltonian Descent, including most common ones, such as LION, Adam. Inspired by our theoretical understanding, we propose Online Subspace Descent, a new family of subspace descent optimizer without SVD. Instead of updating the projection matrix with eigenvectors, Online Subspace Descent updates the projection matrix with online PCA. Online Subspace Descent is flexible and introduces only minimum overhead to training. We show that for the task of pretraining LLaMA models ranging from 60M to 7B parameters on the C4 dataset, Online Subspace Descent achieves lower perplexity and better downstream tasks performance than state-of-the-art low-rank training methods across different settings and narrows the gap with full-rank baselines.

Memory-Efficient LLM Training with Online Subspace Descent

TL;DR

Abstract

Paper Structure (25 sections, 1 theorem, 36 equations, 3 figures, 6 tables, 1 algorithm)

This paper contains 25 sections, 1 theorem, 36 equations, 3 figures, 6 tables, 1 algorithm.

Introduction
Optimization Background
Hamiltonian+Descent
Memory-Efficient Optimizers via Online Subspace Descent
Static Subspace Descent
Online Subspace Descent
Difficulty in Theoretical Understanding
Hamiltonian Descent Meets Subspace Descent: A Lyapunov Analysis
Online Subspace Descent Preserves the Hamiltonian+Descent Structure
Convergence to Local Optima
Online Subspace Descent with General Linear Projection Operators
Experiment
Why do we Need Online Subspace Descent?
What Rank Should we Pick for Online Subspace Descent?
What are the Best Hyperparameters?
...and 10 more sections

Key Result

Theorem 4.5

Assume Assumption def:ad holds. Let $({\boldsymbol{W}}_t, {\boldsymbol{S}}_t, {\boldsymbol{P}}_t)_{t}$ be a bounded solution of equ:hadyp, then all the accumulation points $\{{\boldsymbol{W}}_t \}$ as $t\to +\infty$ are stationary points of $L({\boldsymbol{W}})$.

Figures (3)

Figure 1: Pretraining LLaMA 1B with a sequence length of 256 and for 10K steps, perplexity was reported as the training average of the last 10 steps. AdamW8bit serves as the base optimizer for both.
Figure 2: The execution time of torch.svd and that a single-step backward() call for online PCA in PyTorch, on matrices of typical shapes in linear layers in the LLaMA 60M to 7B. Thanks to the high speed of single-step online PCA, ${\boldsymbol{P}}_t$ updates can be executed in parallel with weight updates, adding no overhead to the training process. In contrast, SVD incurs significant overhead as the model and weight tensor sizes increase.
Figure 3: From left to right are loss curves of 10K steps on LLaMA 60M: leftmost is the sweep of rank, middle is the sweep of $\alpha$ and rightmost is the sweep of $\lambda$.

Theorems & Definitions (10)

Example 2.1
Example 2.2
Example 2.3
Example 2.4
Example 3.1
Example 4.1
Example 4.2
Example 4.3
Theorem 4.5
proof

Memory-Efficient LLM Training with Online Subspace Descent

TL;DR

Abstract

Memory-Efficient LLM Training with Online Subspace Descent

Authors

TL;DR

Abstract

Table of Contents

Key Result

Figures (3)

Theorems & Definitions (10)