Table of Contents
Fetching ...

ESPACE: Dimensionality Reduction of Activations for Model Compression

Charbel Sakr, Brucek Khailany

TL;DR

ESPACE introduces activation-centric model compression by projecting activation tensors onto a static, pre-calibrated orthonormal basis, reducing activation dimensionality without altering trainable weights. The authors derive theoretical guidance for constructing the projection matrix via eigen-decomposition of activation auto-correlation and related bounds, yielding six candidate projection schemes for each layer and enabling layer-wise selection. Empirically, ESPACE achieves up to ~50% compression on GPT3, Llama2, and Nemotron4 with small perplexity changes (e.g., +0.18 on GPT3-22B) and substantial GEMM latency reductions (up to ~46%), with occasional perplexity improvements at moderate compression. Compared to weight-centric tensor decompositions, ESPACE offers a practical, activation-based route to compression that preserves expressivity during retraining and leverages matrix multiplication associativity for inference-time compression, establishing a new direction in LLM tensor decomposition research.

Abstract

We propose ESPACE, an LLM compression technique based on dimensionality reduction of activations. Unlike prior works on weight-centric tensor decomposition, ESPACE projects activations onto a pre-calibrated set of principal components. The activation-centrality of the approach enables retraining LLMs with no loss of expressivity; while at inference, weight decomposition is obtained as a byproduct of matrix multiplication associativity. Theoretical results on the construction of projection matrices with optimal computational accuracy are provided. Experimentally, we find ESPACE enables 50% compression of GPT3, Llama2, and Nemotron4 models with small accuracy degradation, as low as a 0.18 perplexity increase on GPT3-22B. At lower compression rates of 20% to 40%, ESPACE drives GPT3 models to outperforming their baseline, by up to a 0.38 decrease in perplexity for GPT3-8B. ESPACE also reduces GEMM execution time and prefill inference latency on existing hardware. Comparison with related works on compressing Llama2-7B via matrix factorization shows that ESPACE is a first step in advancing the state-of-the-art in tensor decomposition compression of LLMs.

ESPACE: Dimensionality Reduction of Activations for Model Compression

TL;DR

ESPACE introduces activation-centric model compression by projecting activation tensors onto a static, pre-calibrated orthonormal basis, reducing activation dimensionality without altering trainable weights. The authors derive theoretical guidance for constructing the projection matrix via eigen-decomposition of activation auto-correlation and related bounds, yielding six candidate projection schemes for each layer and enabling layer-wise selection. Empirically, ESPACE achieves up to ~50% compression on GPT3, Llama2, and Nemotron4 with small perplexity changes (e.g., +0.18 on GPT3-22B) and substantial GEMM latency reductions (up to ~46%), with occasional perplexity improvements at moderate compression. Compared to weight-centric tensor decompositions, ESPACE offers a practical, activation-based route to compression that preserves expressivity during retraining and leverages matrix multiplication associativity for inference-time compression, establishing a new direction in LLM tensor decomposition research.

Abstract

We propose ESPACE, an LLM compression technique based on dimensionality reduction of activations. Unlike prior works on weight-centric tensor decomposition, ESPACE projects activations onto a pre-calibrated set of principal components. The activation-centrality of the approach enables retraining LLMs with no loss of expressivity; while at inference, weight decomposition is obtained as a byproduct of matrix multiplication associativity. Theoretical results on the construction of projection matrices with optimal computational accuracy are provided. Experimentally, we find ESPACE enables 50% compression of GPT3, Llama2, and Nemotron4 models with small accuracy degradation, as low as a 0.18 perplexity increase on GPT3-22B. At lower compression rates of 20% to 40%, ESPACE drives GPT3 models to outperforming their baseline, by up to a 0.38 decrease in perplexity for GPT3-8B. ESPACE also reduces GEMM execution time and prefill inference latency on existing hardware. Comparison with related works on compressing Llama2-7B via matrix factorization shows that ESPACE is a first step in advancing the state-of-the-art in tensor decomposition compression of LLMs.
Paper Structure (30 sections, 4 theorems, 29 equations, 8 figures, 2 tables)

This paper contains 30 sections, 4 theorems, 29 equations, 8 figures, 2 tables.

Key Result

Theorem 1

For an activation tensor $\mathbf{X}$ whose auto-correlation matrix has an eigenvalue decomposition given by $\mathbf{C}_{\mathbf{X}} = \mathbf{V}\mathbf{D}\mathbf{V}^T$, the projection matrix $\mathbf{P}$ minimizing the mean squared error in eqn:mse is given by $\mathbf{P} = \left[\mathbf{v}_1 | \l

Figures (8)

  • Figure 1: Perplexity$^{\ref{['fn:ppl']}}$ versus model size for GPT3 and Llama2 models and comparison to compressed models using ESPACE.
  • Figure 2: Decompositions in GEMMs: (a) baseline multiplication of weight matrix and activation tensor, (b) truncated SVD on the weight matrix, and (c) proposed approach of inserting a static matrix to project activations. With ESPACE, all weights are available for training, while inference compression is achieved via per-computation of $\left( \mathbf{P}^T \mathbf{W}\right)$.
  • Figure 3: Validation perplexity for GPT3-22B when ESPACE is progressively applied to its GEMM layers. The order of layer selection is based on a layer-wise sensitivity analysis.
  • Figure 4: Comparison to related works compressing Llama2-7B using matrix factorization techniques.
  • Figure 5: Sensitivity studies on the choice of projection construction for (a) GPT3-1.3B, (b) GPT3-8B, (c) GPT3-22B. For each layer, we apply ESPACE out-of-the-box using the six various candidates for the projection matrix $\mathbf{P}$ constructed in Section \ref{['sec:theory']}. The black line corresponds to the baseline perplexity.
  • ...and 3 more figures

Theorems & Definitions (8)

  • Theorem 1
  • proof
  • Corollary 2
  • proof
  • Proposition 3
  • proof
  • Theorem 4
  • proof