PRAC: Principal-Random Subspace for LLM Activation Compression and Memory-Efficient Training

Yanyi Li; Yimu Zhang; Cong Fang

PRAC: Principal-Random Subspace for LLM Activation Compression and Memory-Efficient Training

Yanyi Li, Yimu Zhang, Cong Fang

TL;DR

This work proposes Principal-Random Subspace for LLM Activation Compression (PRAC), which novelly decomposes activations into two components: a principal subspace captured via SVD to retain dominant information, and a random subspace sampled from the orthogonal complement to approximate the tail.

Abstract

Activations have become the primary memory bottleneck in large-batch LLM training. However, existing compression methods fail to exploit the spectral structure of activations, resulting in slow convergence or limited compression. To address this, we bridge the relationship between the algorithm's fast convergence and the requirements for subspace projection, and show that an effective compression should yield an unbiased estimate of the original activation with low variance. We propose Principal-Random Subspace for LLM Activation Compression (PRAC), which novelly decomposes activations into two components: a principal subspace captured via SVD to retain dominant information, and a random subspace sampled from the orthogonal complement to approximate the tail. By introducing a precise scaling factor, we prove that PRAC yields an unbiased gradient estimator with minimum variance under certain conditions. Extensive experiments on pre-training and fine-tuning tasks demonstrate that PRAC achieves up to 36% total memory reduction with negligible performance degradation and minimal computational cost.

PRAC: Principal-Random Subspace for LLM Activation Compression and Memory-Efficient Training

TL;DR

Abstract

Paper Structure (26 sections, 15 theorems, 40 equations, 7 figures, 9 tables, 1 algorithm)

This paper contains 26 sections, 15 theorems, 40 equations, 7 figures, 9 tables, 1 algorithm.

Introduction
Related Works
Starting Point of Activation Compression
Problem Formulation
Criteria for Activation Compression
Proposed Method: PRAC
Spectral Analysis of Activations
Key Components of PRAC
Optimal Design via Hybrid Projection
PRAC for Memory-Efficient Training
PRAC for LLM Training
Memory and Computational Efficiency
Experiments
Memory-Efficient Pre-training
Memory-Efficient Fine-tuning
...and 11 more sections

Key Result

Proposition 1

Consider two-dimensional strongly convex problem: $f(w^{(1)},w^{(2)}) =\mathbb{E}_{\xi'}(w^{(1)}- 3\xi')^2+(w^{(2)})^2$ with $\xi'$ following a Bernoulli distribution. The parameters are initialized at $w^{(1)} = 0, w^{(2)} = 1$. At each step, only the first coordinate is updated by a stochastic gra

Figures (7)

Figure 1: The proposed PRAC projects activations onto both the principal and random subspaces and yields the minimum variance unbiased estimator, thus achieving up to 36% total memory reduction with negligible performance degradation. (Left) The flowchart of PRAC; (Middle) A conceptual comparison of subspace strategies; (Right) Performance comparison on LLaMA-1B.
Figure 2: Singular value spectrum (Left) and cumulative energy ratio (Right).
Figure 3: Loss curves of pre-training LLaMA-130M and LLaMA-350M model.
Figure 4: Scaling behavior of PRAC across model sizes (35M to 1B parameters)
Figure 5: Loss curve of using PRAC, PAC, RAC in LLaMA series pre-training. RAC reports the result of $r=0.6$ due to the divergence of $r=0.3$.
...and 2 more figures

Theorems & Definitions (16)

Proposition 1: Non-convergence under Constant Bias
Proposition 2: Convergence Rate with Bounded Gradient Variance, sgd_pre1
Lemma 3
Lemma 5: Lower Bound
Theorem 6: Unbiased Reconstruction of PRAC
Theorem 7: Variance Bound of PRAC
Definition 8: Orthogonal Group
Lemma 9
Lemma 10
Lemma 11
...and 6 more

PRAC: Principal-Random Subspace for LLM Activation Compression and Memory-Efficient Training

TL;DR

Abstract

PRAC: Principal-Random Subspace for LLM Activation Compression and Memory-Efficient Training

Authors

TL;DR

Abstract

Table of Contents

Key Result

Figures (7)

Theorems & Definitions (16)