Table of Contents
Fetching ...

Importance-Aware Activation Space Reconstruction

Md Mokarram Chowdhury, Daniel Agyei Asante, Ernie Chang, Yang Li

TL;DR

This work tackles the challenge of deploying large language models in resource-limited environments by reframing compression from weights to activations with an importance-aware objective. The authors derive IMPACT, a principled framework whose core insight is that the optimal activation reconstruction directions are the top eigenvectors of an importance-weighted activation covariance matrix $\mathbf{C} = \mathrm{Cov}(\mathbf{y}) \odot \mathbf{M}$, where $\mathbf{M}$ encodes gradient-driven importance. By transforming activations, bounding the objective, and solving a tractable eigenproblem, IMPACT yields a closed-form, two-layer compressed representation $\hat{\mathbf{y}}$ that minimizes performance degradation. Empirically, IMPACT achieves up to 48.6% greater size reduction while maintaining accuracy across mathematical reasoning and code generation tasks on Llama 2 and CodeLlama models, and it can synergize with quantization to further improve efficiency and throughput. These results demonstrate a practical, scalable path to deploying capable transformers in constrained settings without sacrificing performance.

Abstract

Large language models (LLMs) achieve strong performance across many domains but are difficult to deploy in resource-constrained settings due to their size. Low-rank weight matrix compression is a popular strategy for reducing model size, typically by minimizing weight reconstruction error under the assumption that weights are low-rank. However, this assumption often does not hold in LLMs. Instead, LLM activations exhibit stronger low-rank structure-prompting a shift toward minimizing activation reconstruction error. We show that this shift alone is insufficient: activation dimensions contribute unequally to model performance, and uniform reconstruction can harm performance. We propose IMPACT, a principled framework for importance-aware activation reconstruction that links model compression decisions to their impact on model behavior. IMPACT formulates an optimization problem that considers both activation structure and gradient sensitivity, and derives a closed-form solution where the optimal reconstruction bases are the eigenvectors of an importance-weighted activation covariance matrix. This enables low-rank approximations explicitly optimized to preserve accuracy. Experiments across diverse models and tasks show that IMPACT achieves up to 48.6% greater model size reduction with accuracy comparable to state-of-the-art baselines.

Importance-Aware Activation Space Reconstruction

TL;DR

This work tackles the challenge of deploying large language models in resource-limited environments by reframing compression from weights to activations with an importance-aware objective. The authors derive IMPACT, a principled framework whose core insight is that the optimal activation reconstruction directions are the top eigenvectors of an importance-weighted activation covariance matrix , where encodes gradient-driven importance. By transforming activations, bounding the objective, and solving a tractable eigenproblem, IMPACT yields a closed-form, two-layer compressed representation that minimizes performance degradation. Empirically, IMPACT achieves up to 48.6% greater size reduction while maintaining accuracy across mathematical reasoning and code generation tasks on Llama 2 and CodeLlama models, and it can synergize with quantization to further improve efficiency and throughput. These results demonstrate a practical, scalable path to deploying capable transformers in constrained settings without sacrificing performance.

Abstract

Large language models (LLMs) achieve strong performance across many domains but are difficult to deploy in resource-constrained settings due to their size. Low-rank weight matrix compression is a popular strategy for reducing model size, typically by minimizing weight reconstruction error under the assumption that weights are low-rank. However, this assumption often does not hold in LLMs. Instead, LLM activations exhibit stronger low-rank structure-prompting a shift toward minimizing activation reconstruction error. We show that this shift alone is insufficient: activation dimensions contribute unequally to model performance, and uniform reconstruction can harm performance. We propose IMPACT, a principled framework for importance-aware activation reconstruction that links model compression decisions to their impact on model behavior. IMPACT formulates an optimization problem that considers both activation structure and gradient sensitivity, and derives a closed-form solution where the optimal reconstruction bases are the eigenvectors of an importance-weighted activation covariance matrix. This enables low-rank approximations explicitly optimized to preserve accuracy. Experiments across diverse models and tasks show that IMPACT achieves up to 48.6% greater model size reduction with accuracy comparable to state-of-the-art baselines.

Paper Structure

This paper contains 22 sections, 6 theorems, 56 equations, 10 figures, 10 tables, 1 algorithm.

Key Result

Theorem 1

Suppose the loss function $\ell$ is $C^1$-smooth, and the activation dimension is $d$. The objective function in Equation eq:eqn1 is upper bounded by:

Figures (10)

  • Figure 1: Normalized average gradient magnitudes across activation dimensions in Llama 2-7B on a mathematical reasoning task. For each linear layer, the output activation is a vector $\mathbf{y} \in \mathbb{R}^d$, where each component $y_i$ is referred to as an activation dimension. Dimensions are sorted in descending order by their normalized squared gradient magnitude, which indicates their relative contribution to the loss. The gradient magnitudes vary substantially across activation dimensions—a pattern also consistently observed in other models and tasks.
  • Figure 2: Pass@1 accuracy and model size of Llama 2-7B compressed with various low-rank algorithms on the mathematical reasoning task. Exact values are listed in Table \ref{['tbl:7b-math']} in Appendix \ref{['sec:appendix2']}.
  • Figure 3: Pass@1 accuracy and model size of Llama 2-13B compressed with various low-rank algorithms on the mathematical reasoning task. Exact values are listed in Table \ref{['tbl:13b-math']} in Appendix \ref{['sec:appendix2']}.
  • Figure 4: Pass@1 accuracy and model size of CodeLlama-7B compressed with various low-rank algorithms on the code generation task. Exact values are listed in Table \ref{['tbl:7b-programming']} in Appendix \ref{['sec:appendix2']}.
  • Figure 5: Pass@1 accuracy and model size of CodeLlama-13B compressed with various low-rank algorithms on the code generation task. Exact values are listed in Table \ref{['tbl:13b-programming']} in Appendix \ref{['sec:appendix2']}.
  • ...and 5 more figures

Theorems & Definitions (7)

  • Theorem 1: Bounding Theorem
  • Theorem 2: Activation Space Transformation Theorem
  • Theorem 3: Importance-Weighted Activation Covariance Matrix
  • Corollary 1
  • Theorem 4: Reconstruction Direction Theorem
  • Theorem 5: Activation Reconstruction Theorem
  • Definition 1: Differentiation Convention