ESPACE: Dimensionality Reduction of Activations for Model Compression

Charbel Sakr; Brucek Khailany

ESPACE: Dimensionality Reduction of Activations for Model Compression

Charbel Sakr, Brucek Khailany

TL;DR

ESPACE introduces activation-centric model compression by projecting activation tensors onto a static, pre-calibrated orthonormal basis, reducing activation dimensionality without altering trainable weights. The authors derive theoretical guidance for constructing the projection matrix via eigen-decomposition of activation auto-correlation and related bounds, yielding six candidate projection schemes for each layer and enabling layer-wise selection. Empirically, ESPACE achieves up to ~50% compression on GPT3, Llama2, and Nemotron4 with small perplexity changes (e.g., +0.18 on GPT3-22B) and substantial GEMM latency reductions (up to ~46%), with occasional perplexity improvements at moderate compression. Compared to weight-centric tensor decompositions, ESPACE offers a practical, activation-based route to compression that preserves expressivity during retraining and leverages matrix multiplication associativity for inference-time compression, establishing a new direction in LLM tensor decomposition research.

Abstract

We propose ESPACE, an LLM compression technique based on dimensionality reduction of activations. Unlike prior works on weight-centric tensor decomposition, ESPACE projects activations onto a pre-calibrated set of principal components. The activation-centrality of the approach enables retraining LLMs with no loss of expressivity; while at inference, weight decomposition is obtained as a byproduct of matrix multiplication associativity. Theoretical results on the construction of projection matrices with optimal computational accuracy are provided. Experimentally, we find ESPACE enables 50% compression of GPT3, Llama2, and Nemotron4 models with small accuracy degradation, as low as a 0.18 perplexity increase on GPT3-22B. At lower compression rates of 20% to 40%, ESPACE drives GPT3 models to outperforming their baseline, by up to a 0.38 decrease in perplexity for GPT3-8B. ESPACE also reduces GEMM execution time and prefill inference latency on existing hardware. Comparison with related works on compressing Llama2-7B via matrix factorization shows that ESPACE is a first step in advancing the state-of-the-art in tensor decomposition compression of LLMs.

ESPACE: Dimensionality Reduction of Activations for Model Compression

TL;DR

Abstract

Paper Structure (30 sections, 4 theorems, 29 equations, 8 figures, 2 tables)

This paper contains 30 sections, 4 theorems, 29 equations, 8 figures, 2 tables.

Introduction
Related work and motivation for activation-centric tensor decomposition
Contributions
Dimensionality Reduction & Projections
Matrix Multiplication and Weight Decomposition
Activation Decomposition via Static Projection
Eigen Static Principal Activation Component Estimation
Activation auto-correlation estimation
Activation decomposition with minimum mean squared error
Activation decomposition with optimized forward propagated accuracy metrics
Model Compression Studies
Experimental setup
Validation perplexity studies
Compression of GPT3 models
Compression of Llama2 models and comparison to related works
...and 15 more sections

Key Result

Theorem 1

For an activation tensor $\mathbf{X}$ whose auto-correlation matrix has an eigenvalue decomposition given by $\mathbf{C}_{\mathbf{X}} = \mathbf{V}\mathbf{D}\mathbf{V}^T$, the projection matrix $\mathbf{P}$ minimizing the mean squared error in eqn:mse is given by $\mathbf{P} = \left[\mathbf{v}_1 | \l

Figures (8)

Figure 1: Perplexity$^{\ref{['fn:ppl']}}$ versus model size for GPT3 and Llama2 models and comparison to compressed models using ESPACE.
Figure 2: Decompositions in GEMMs: (a) baseline multiplication of weight matrix and activation tensor, (b) truncated SVD on the weight matrix, and (c) proposed approach of inserting a static matrix to project activations. With ESPACE, all weights are available for training, while inference compression is achieved via per-computation of $\left( \mathbf{P}^T \mathbf{W}\right)$.
Figure 3: Validation perplexity for GPT3-22B when ESPACE is progressively applied to its GEMM layers. The order of layer selection is based on a layer-wise sensitivity analysis.
Figure 4: Comparison to related works compressing Llama2-7B using matrix factorization techniques.
Figure 5: Sensitivity studies on the choice of projection construction for (a) GPT3-1.3B, (b) GPT3-8B, (c) GPT3-22B. For each layer, we apply ESPACE out-of-the-box using the six various candidates for the projection matrix $\mathbf{P}$ constructed in Section \ref{['sec:theory']}. The black line corresponds to the baseline perplexity.
...and 3 more figures

Theorems & Definitions (8)

Theorem 1
proof
Corollary 2
proof
Proposition 3
proof
Theorem 4
proof

ESPACE: Dimensionality Reduction of Activations for Model Compression

TL;DR

Abstract

ESPACE: Dimensionality Reduction of Activations for Model Compression

Authors

TL;DR

Abstract

Table of Contents

Key Result

Figures (8)

Theorems & Definitions (8)