Training-Free Activation Sparsity in Large Language Models
James Liu, Pragaash Ponnusamy, Tianle Cai, Han Guo, Yoon Kim, Ben Athiwaratkun
TL;DR
TEAL introduces a training-free activation-sparsity method that sparsifies hidden states across all matrices in modern LLMs by magnitude pruning. It demonstrates that 40–50% model-wide sparsity can be achieved with minimal accuracy degradation and gains wall-clock decoding speed (up to $1.53\times$ at 40% and $1.8\times$ at 50% sparsity) using optimized sparse kernels. The approach is validated across Llama-2, Llama-3, and Mistral models (7B–70B) and is compatible with weight quantization, enabling further efficiency. TEAL’s analysis includes a block-wise greedy sparsification strategy, hardware-aware acceleration, and insights into sparsity distribution across attention and MLP components, highlighting its practical potential for edge and latency-constrained deployments.
Abstract
Activation sparsity can enable practical inference speedups in large language models (LLMs) by reducing the compute and memory-movement required for matrix multiplications during the forward pass. However, existing methods face limitations that inhibit widespread adoption. Some approaches are tailored towards older models with ReLU-based sparsity, while others require extensive continued pre-training on up to hundreds of billions of tokens. This paper describes TEAL, a simple training-free method that applies magnitude-based activation sparsity to hidden states throughout the entire model. TEAL achieves 40-50% model-wide sparsity with minimal performance degradation across Llama-2, Llama-3, and Mistral families, with sizes varying from 7B to 70B. We improve existing sparse kernels and demonstrate wall-clock decoding speed-ups of up to 1.53$\times$ and 1.8$\times$ at 40% and 50% model-wide sparsity. TEAL is compatible with weight quantization, enabling further efficiency gains.
