ESPACE: Dimensionality Reduction of Activations for Model Compression
Charbel Sakr, Brucek Khailany
TL;DR
ESPACE introduces activation-centric model compression by projecting activation tensors onto a static, pre-calibrated orthonormal basis, reducing activation dimensionality without altering trainable weights. The authors derive theoretical guidance for constructing the projection matrix via eigen-decomposition of activation auto-correlation and related bounds, yielding six candidate projection schemes for each layer and enabling layer-wise selection. Empirically, ESPACE achieves up to ~50% compression on GPT3, Llama2, and Nemotron4 with small perplexity changes (e.g., +0.18 on GPT3-22B) and substantial GEMM latency reductions (up to ~46%), with occasional perplexity improvements at moderate compression. Compared to weight-centric tensor decompositions, ESPACE offers a practical, activation-based route to compression that preserves expressivity during retraining and leverages matrix multiplication associativity for inference-time compression, establishing a new direction in LLM tensor decomposition research.
Abstract
We propose ESPACE, an LLM compression technique based on dimensionality reduction of activations. Unlike prior works on weight-centric tensor decomposition, ESPACE projects activations onto a pre-calibrated set of principal components. The activation-centrality of the approach enables retraining LLMs with no loss of expressivity; while at inference, weight decomposition is obtained as a byproduct of matrix multiplication associativity. Theoretical results on the construction of projection matrices with optimal computational accuracy are provided. Experimentally, we find ESPACE enables 50% compression of GPT3, Llama2, and Nemotron4 models with small accuracy degradation, as low as a 0.18 perplexity increase on GPT3-22B. At lower compression rates of 20% to 40%, ESPACE drives GPT3 models to outperforming their baseline, by up to a 0.38 decrease in perplexity for GPT3-8B. ESPACE also reduces GEMM execution time and prefill inference latency on existing hardware. Comparison with related works on compressing Llama2-7B via matrix factorization shows that ESPACE is a first step in advancing the state-of-the-art in tensor decomposition compression of LLMs.
