Table of Contents
Fetching ...

Training-Free Activation Sparsity in Large Language Models

James Liu, Pragaash Ponnusamy, Tianle Cai, Han Guo, Yoon Kim, Ben Athiwaratkun

TL;DR

TEAL introduces a training-free activation-sparsity method that sparsifies hidden states across all matrices in modern LLMs by magnitude pruning. It demonstrates that 40–50% model-wide sparsity can be achieved with minimal accuracy degradation and gains wall-clock decoding speed (up to $1.53\times$ at 40% and $1.8\times$ at 50% sparsity) using optimized sparse kernels. The approach is validated across Llama-2, Llama-3, and Mistral models (7B–70B) and is compatible with weight quantization, enabling further efficiency. TEAL’s analysis includes a block-wise greedy sparsification strategy, hardware-aware acceleration, and insights into sparsity distribution across attention and MLP components, highlighting its practical potential for edge and latency-constrained deployments.

Abstract

Activation sparsity can enable practical inference speedups in large language models (LLMs) by reducing the compute and memory-movement required for matrix multiplications during the forward pass. However, existing methods face limitations that inhibit widespread adoption. Some approaches are tailored towards older models with ReLU-based sparsity, while others require extensive continued pre-training on up to hundreds of billions of tokens. This paper describes TEAL, a simple training-free method that applies magnitude-based activation sparsity to hidden states throughout the entire model. TEAL achieves 40-50% model-wide sparsity with minimal performance degradation across Llama-2, Llama-3, and Mistral families, with sizes varying from 7B to 70B. We improve existing sparse kernels and demonstrate wall-clock decoding speed-ups of up to 1.53$\times$ and 1.8$\times$ at 40% and 50% model-wide sparsity. TEAL is compatible with weight quantization, enabling further efficiency gains.

Training-Free Activation Sparsity in Large Language Models

TL;DR

TEAL introduces a training-free activation-sparsity method that sparsifies hidden states across all matrices in modern LLMs by magnitude pruning. It demonstrates that 40–50% model-wide sparsity can be achieved with minimal accuracy degradation and gains wall-clock decoding speed (up to at 40% and at 50% sparsity) using optimized sparse kernels. The approach is validated across Llama-2, Llama-3, and Mistral models (7B–70B) and is compatible with weight quantization, enabling further efficiency. TEAL’s analysis includes a block-wise greedy sparsification strategy, hardware-aware acceleration, and insights into sparsity distribution across attention and MLP components, highlighting its practical potential for edge and latency-constrained deployments.

Abstract

Activation sparsity can enable practical inference speedups in large language models (LLMs) by reducing the compute and memory-movement required for matrix multiplications during the forward pass. However, existing methods face limitations that inhibit widespread adoption. Some approaches are tailored towards older models with ReLU-based sparsity, while others require extensive continued pre-training on up to hundreds of billions of tokens. This paper describes TEAL, a simple training-free method that applies magnitude-based activation sparsity to hidden states throughout the entire model. TEAL achieves 40-50% model-wide sparsity with minimal performance degradation across Llama-2, Llama-3, and Mistral families, with sizes varying from 7B to 70B. We improve existing sparse kernels and demonstrate wall-clock decoding speed-ups of up to 1.53 and 1.8 at 40% and 50% model-wide sparsity. TEAL is compatible with weight quantization, enabling further efficiency gains.
Paper Structure (32 sections, 3 theorems, 16 equations, 11 figures, 10 tables, 1 algorithm)

This paper contains 32 sections, 3 theorems, 16 equations, 11 figures, 10 tables, 1 algorithm.

Key Result

Lemma A.1

For independent random normal variables $X \sim N(0, \sigma_X^2), W \sim N(0,\sigma^2_W)$ and sparsification function $s_{t_p}(\cdot)$, the variance of $(X-s_{t_p}(X))W$ is given by: where $\varphi(t) = \frac{1}{\sqrt{2\pi}}e^{-\frac{1}{2}t^2}$ is the probability density function of the standard normal distribution.

Figures (11)

  • Figure 1: Overview of TEAL. During decoding, TEAL thresholds low-magnitude activation entries to zero, which obviates the need to move the associated weight channels onto the registers, thus enabling wall-clock speed-ups.
  • Figure 2: Activation distributions of Llama-3-8B's four hidden states at Blocks 8, 16, and 24. The activations preceding the Attention and MLP blocks typically exhibit Gaussian-like shapes, while intermediate states within these blocks exhibit Laplacian-like shapes. The best-fit Gaussian/Laplace distributions are overlayed in blue.
  • Figure 3: Latency vs. sparsity for matrix-vector multiplication (1x4096 × 4096x14336), comparing TEAL to Deja Vu. 'Theoretical Optimal' shows the latency reduction for torch.matmul assuming perfect linear scaling with sparsity.
  • Figure 4: Perplexity vs. sparsity for Llama-2-7B quantized to various bitwidths on WikiText. Left: Performance over sparsity levels. Right: Performance normalized by bitwidth.
  • Figure 5: Layer-level activation error for $\mathbf{W}_\text{up}$ at Block 16 of Llama-3-8B: TEAL utilizing input sparsity, and CATS utilizing output sparsity.
  • ...and 6 more figures

Theorems & Definitions (8)

  • Definition 1
  • Definition A.1
  • Lemma A.1: Variance of Scalar Sparsified Error
  • proof
  • Lemma A.2: Expected $\ell_2$ Norm of Sparsified Matrix-Vector Error
  • proof
  • Theorem A.1: Distributional Relative Error
  • proof