Table of Contents
Fetching ...

PRAC: Principal-Random Subspace for LLM Activation Compression and Memory-Efficient Training

Yanyi Li, Yimu Zhang, Cong Fang

TL;DR

This work proposes Principal-Random Subspace for LLM Activation Compression (PRAC), which novelly decomposes activations into two components: a principal subspace captured via SVD to retain dominant information, and a random subspace sampled from the orthogonal complement to approximate the tail.

Abstract

Activations have become the primary memory bottleneck in large-batch LLM training. However, existing compression methods fail to exploit the spectral structure of activations, resulting in slow convergence or limited compression. To address this, we bridge the relationship between the algorithm's fast convergence and the requirements for subspace projection, and show that an effective compression should yield an unbiased estimate of the original activation with low variance. We propose Principal-Random Subspace for LLM Activation Compression (PRAC), which novelly decomposes activations into two components: a principal subspace captured via SVD to retain dominant information, and a random subspace sampled from the orthogonal complement to approximate the tail. By introducing a precise scaling factor, we prove that PRAC yields an unbiased gradient estimator with minimum variance under certain conditions. Extensive experiments on pre-training and fine-tuning tasks demonstrate that PRAC achieves up to 36% total memory reduction with negligible performance degradation and minimal computational cost.

PRAC: Principal-Random Subspace for LLM Activation Compression and Memory-Efficient Training

TL;DR

This work proposes Principal-Random Subspace for LLM Activation Compression (PRAC), which novelly decomposes activations into two components: a principal subspace captured via SVD to retain dominant information, and a random subspace sampled from the orthogonal complement to approximate the tail.

Abstract

Activations have become the primary memory bottleneck in large-batch LLM training. However, existing compression methods fail to exploit the spectral structure of activations, resulting in slow convergence or limited compression. To address this, we bridge the relationship between the algorithm's fast convergence and the requirements for subspace projection, and show that an effective compression should yield an unbiased estimate of the original activation with low variance. We propose Principal-Random Subspace for LLM Activation Compression (PRAC), which novelly decomposes activations into two components: a principal subspace captured via SVD to retain dominant information, and a random subspace sampled from the orthogonal complement to approximate the tail. By introducing a precise scaling factor, we prove that PRAC yields an unbiased gradient estimator with minimum variance under certain conditions. Extensive experiments on pre-training and fine-tuning tasks demonstrate that PRAC achieves up to 36% total memory reduction with negligible performance degradation and minimal computational cost.
Paper Structure (26 sections, 15 theorems, 40 equations, 7 figures, 9 tables, 1 algorithm)

This paper contains 26 sections, 15 theorems, 40 equations, 7 figures, 9 tables, 1 algorithm.

Key Result

Proposition 1

Consider two-dimensional strongly convex problem: $f(w^{(1)},w^{(2)}) =\mathbb{E}_{\xi'}(w^{(1)}- 3\xi')^2+(w^{(2)})^2$ with $\xi'$ following a Bernoulli distribution. The parameters are initialized at $w^{(1)} = 0, w^{(2)} = 1$. At each step, only the first coordinate is updated by a stochastic gra

Figures (7)

  • Figure 1: The proposed PRAC projects activations onto both the principal and random subspaces and yields the minimum variance unbiased estimator, thus achieving up to 36% total memory reduction with negligible performance degradation. (Left) The flowchart of PRAC; (Middle) A conceptual comparison of subspace strategies; (Right) Performance comparison on LLaMA-1B.
  • Figure 2: Singular value spectrum (Left) and cumulative energy ratio (Right).
  • Figure 3: Loss curves of pre-training LLaMA-130M and LLaMA-350M model.
  • Figure 4: Scaling behavior of PRAC across model sizes (35M to 1B parameters)
  • Figure 5: Loss curve of using PRAC, PAC, RAC in LLaMA series pre-training. RAC reports the result of $r=0.6$ due to the divergence of $r=0.3$.
  • ...and 2 more figures

Theorems & Definitions (16)

  • Proposition 1: Non-convergence under Constant Bias
  • Proposition 2: Convergence Rate with Bounded Gradient Variance, sgd_pre1
  • Lemma 3
  • Lemma 5: Lower Bound
  • Theorem 6: Unbiased Reconstruction of PRAC
  • Theorem 7: Variance Bound of PRAC
  • Definition 8: Orthogonal Group
  • Lemma 9
  • Lemma 10
  • Lemma 11
  • ...and 6 more