Table of Contents
Fetching ...

ActTail: Global Activation Sparsity in Large Language Models

Wenwen Hou, Xinyuan Song, Shiwei Liu

Abstract

Activation sparsity is a promising approach for accelerating large language model (LLM) inference by reducing computation and memory movement. However, existing activation sparsity methods typically apply uniform sparsity across projections, ignoring the heterogeneous statistical properties of Transformer weights and thereby amplifying performance degradation. In this paper, we propose ActTail, a TopK magnitude-based activation sparsity method with global activation sparsity allocation grounded in Heavy-Tailed Self-Regularization (HT-SR) theory. Specifically, we capture this heterogeneity via the heavy-tail exponent computed from each projection's empirical spectral density (ESD), which is used as a quantitative indicator to assign projection-specific sparsity budgets. Importantly, we provide a theoretical analysis that establishes an explicit relationship between the activation sparsity ratio and the heavy-tail exponent under the HT-SR regime, offering principled guidance for sparsity allocation beyond heuristic design. Experiments on LLaMA and Mistral models show that our method improves both perplexity and downstream task performance at high sparsity compared to uniform allocation. At 80% sparsity, perplexity is reduced by 21.8% on LLaMA-2-7B, 40.1% on LLaMA-2-13B, and 9.4% on Mistral-7B.

ActTail: Global Activation Sparsity in Large Language Models

Abstract

Activation sparsity is a promising approach for accelerating large language model (LLM) inference by reducing computation and memory movement. However, existing activation sparsity methods typically apply uniform sparsity across projections, ignoring the heterogeneous statistical properties of Transformer weights and thereby amplifying performance degradation. In this paper, we propose ActTail, a TopK magnitude-based activation sparsity method with global activation sparsity allocation grounded in Heavy-Tailed Self-Regularization (HT-SR) theory. Specifically, we capture this heterogeneity via the heavy-tail exponent computed from each projection's empirical spectral density (ESD), which is used as a quantitative indicator to assign projection-specific sparsity budgets. Importantly, we provide a theoretical analysis that establishes an explicit relationship between the activation sparsity ratio and the heavy-tail exponent under the HT-SR regime, offering principled guidance for sparsity allocation beyond heuristic design. Experiments on LLaMA and Mistral models show that our method improves both perplexity and downstream task performance at high sparsity compared to uniform allocation. At 80% sparsity, perplexity is reduced by 21.8% on LLaMA-2-7B, 40.1% on LLaMA-2-13B, and 9.4% on Mistral-7B.
Paper Structure (21 sections, 3 theorems, 33 equations, 5 figures, 2 tables)

This paper contains 21 sections, 3 theorems, 33 equations, 5 figures, 2 tables.

Key Result

Lemma 4.1

Let $d\ge 1$ and $p\in(2,\infty)$. For $y\in\mathbb{R}^d$, let $|y|_{(1)}\ge\cdots\ge |y|_{(d)}$ denote the decreasing rearrangement of $\{|y_1|,\ldots,|y_d|\}$. Let $T_K(y)$ be the TopK truncation that keeps the $K$ largest-magnitude coordinates of $y$ and sets the rest to zero. Then for any $K\in\ where $C_p>0$. $\|\cdot\|_{p,\infty}$ is the weak-$\ell_p$ (Lorentz) quasi-norm $\|y\|_{p,\infty}:=

Figures (5)

  • Figure 1: Eigenvalue spectral density (ESD) for MLP gate projection and attention output projection in Layer 1 of Llama2-13B model.
  • Figure 2: Overview of the ActTail workflow. ActTail first estimates heavy-tailed exponents from the empirical spectral density (ESD) of each projection, maps them to projection-level sparsity ratios via a linear rule, and applies TopK activation sparsity during inference.
  • Figure 3: Power-law exponents $\alpha$ (Hill estimator) of the ESDs for different modules across layers in LLaMA2-7B. Smaller $\alpha$ indicates heavier tails.
  • Figure 4: Alpha and sparsity per projection module across layers.
  • Figure 5: Alpha values and sparsity ratios per projection, ordered by layer and module type (Q, K, V, O, Gate, Up, Down).

Theorems & Definitions (6)

  • Lemma 4.1: $\ell_2$ Best $K$-term approximation
  • proof
  • Lemma 4.2: power-law tail integral
  • proof
  • Theorem 4.4: Spectral energy concentration
  • proof