Table of Contents
Fetching ...

CompAct: Compressed Activations for Memory-Efficient LLM Training

Yara Shamshoum, Nitzan Hodos, Yuval Sieradzki, Assaf Schuster

TL;DR

CompAct introduces activation compression via random projections to store compressed linear activations for backward passes, achieving substantial peak memory reductions with minimal performance loss. By operating in a reduced subspace for gradient updates and decompressing only for parameter updates, it shifts memory savings to the dominant compute-graph component, with reported reductions of about 25-30% during pretraining and up to 50% during finetuning, scalable with model size. The method relies on Gaussian random projections, per-layer seeding (or shared seeds), and an update cadence that balances accuracy and speed, and it demonstrates strong memory–throughput tradeoffs on LLaMA pretraining and RoBERTa finetuning. The work outlines practical extensions, including sparse projections and integration with activation checkpointing and other memory-saving strategies, offering a viable path to train larger models within fixed hardware budgets.

Abstract

We introduce CompAct, a technique that reduces peak memory utilization on GPU by 25-30% for pretraining and 50% for fine-tuning of LLMs. Peak device memory is a major limiting factor in training LLMs, with various recent works aiming to reduce model memory. However most works don't target the largest component of allocated memory during training: the model's compute graph, which is stored for the backward pass. By storing low-rank, compressed activations to be used in the backward pass we greatly reduce the required memory, unlike previous methods which only reduce optimizer overheads or the number of trained parameters. Our compression uses random projection matrices, thus avoiding additional memory overheads. Comparisons with previous techniques for either pretraining or fine-tuning show that CompAct substantially improves existing compute-performance tradeoffs. We expect CompAct's savings to scale even higher for larger models.

CompAct: Compressed Activations for Memory-Efficient LLM Training

TL;DR

CompAct introduces activation compression via random projections to store compressed linear activations for backward passes, achieving substantial peak memory reductions with minimal performance loss. By operating in a reduced subspace for gradient updates and decompressing only for parameter updates, it shifts memory savings to the dominant compute-graph component, with reported reductions of about 25-30% during pretraining and up to 50% during finetuning, scalable with model size. The method relies on Gaussian random projections, per-layer seeding (or shared seeds), and an update cadence that balances accuracy and speed, and it demonstrates strong memory–throughput tradeoffs on LLaMA pretraining and RoBERTa finetuning. The work outlines practical extensions, including sparse projections and integration with activation checkpointing and other memory-saving strategies, offering a viable path to train larger models within fixed hardware budgets.

Abstract

We introduce CompAct, a technique that reduces peak memory utilization on GPU by 25-30% for pretraining and 50% for fine-tuning of LLMs. Peak device memory is a major limiting factor in training LLMs, with various recent works aiming to reduce model memory. However most works don't target the largest component of allocated memory during training: the model's compute graph, which is stored for the backward pass. By storing low-rank, compressed activations to be used in the backward pass we greatly reduce the required memory, unlike previous methods which only reduce optimizer overheads or the number of trained parameters. Our compression uses random projection matrices, thus avoiding additional memory overheads. Comparisons with previous techniques for either pretraining or fine-tuning show that CompAct substantially improves existing compute-performance tradeoffs. We expect CompAct's savings to scale even higher for larger models.

Paper Structure

This paper contains 33 sections, 2 theorems, 10 equations, 8 figures, 6 tables, 3 algorithms.

Key Result

Theorem 1

(Convergence of GaLore with fixed projections). Suppose the gradient follows the parametric form: with constant $A_i$, PSD matrices $B_i$ and $C_i$ after $t>t_0$, and $A_i$, $B_i$ and $C_i$ have $L_A$, $L_B$ and $L_C$ continuity with respect to $W$ and $\Vert W_t\Vert\leq D$. Let $R_t := P_t^\top G_t Q_t, \hat{B}_{it}:= P_t^\top B_i(W_t)P_t, \hat{C}_{it} := Q_t^\top C_i(W_t)Q_t$ and $\kappa _t:=

Figures (8)

  • Figure 1: Breakdown of memory components for various LLaMA model sizes, with batch size $256$. Blue: linear operations compressed by CompAct; Red: non-linear operations which CompAct doesn't compress; Green: model parameters and non-linear operation's optimizer states. Most of the memory is used by the computational graph. CompAct's compression gets more significant as model size increases, reaching almost 33% for LLaMA 65B. With $r=n/8$, this translates to almost 30% total memory saved.
  • Figure 2: Overview of CompAct. For a given linear layer $x_{i+1} = x_i W_{i+1}$, we project its input $x_i$ using a random projection matrix $P$, and save the result $z_{i}$ for the backward pass. During the backward pass, we first compute the compressed gradients $\hat{G_i}$ and update the optimizer's parameter update function $\rho_t(\hat{G_i})$. For Adam, $\rho_t$ represents gradient normalization using the first and second gradient moments. Finally, we decompress the gradient back to the full parameter size $\tilde{G_i}$ and perform an update step.
  • Figure 3: (a) Throughput and (b) peak device memory during pretraining of LLaMa-350M. As can be seen, using smaller ranks with CompAct achieves better compression than GaLore while increasing the throughput. When applying activation checkpointing (CKPT), CompAct remains competitive, achieving better throughput and a smaller memory footprint.
  • Figure 4: Final model perplexity of CompAct with $r=n/4$ for different choices of projection matrices. Both Gaussian seed choices and the JL projection achieve comparable results.
  • Figure 5: Ablation on LlaMA 130M - Effect of varying projection update periods $T$ on performance across different ranks in CompAct.
  • ...and 3 more figures

Theorems & Definitions (2)

  • Theorem 1
  • Theorem 2