ActTail: Global Activation Sparsity in Large Language Models

Wenwen Hou; Xinyuan Song; Shiwei Liu

ActTail: Global Activation Sparsity in Large Language Models

Wenwen Hou, Xinyuan Song, Shiwei Liu

Abstract

Activation sparsity is a promising approach for accelerating large language model (LLM) inference by reducing computation and memory movement. However, existing activation sparsity methods typically apply uniform sparsity across projections, ignoring the heterogeneous statistical properties of Transformer weights and thereby amplifying performance degradation. In this paper, we propose ActTail, a TopK magnitude-based activation sparsity method with global activation sparsity allocation grounded in Heavy-Tailed Self-Regularization (HT-SR) theory. Specifically, we capture this heterogeneity via the heavy-tail exponent computed from each projection's empirical spectral density (ESD), which is used as a quantitative indicator to assign projection-specific sparsity budgets. Importantly, we provide a theoretical analysis that establishes an explicit relationship between the activation sparsity ratio and the heavy-tail exponent under the HT-SR regime, offering principled guidance for sparsity allocation beyond heuristic design. Experiments on LLaMA and Mistral models show that our method improves both perplexity and downstream task performance at high sparsity compared to uniform allocation. At 80% sparsity, perplexity is reduced by 21.8% on LLaMA-2-7B, 40.1% on LLaMA-2-13B, and 9.4% on Mistral-7B.

ActTail: Global Activation Sparsity in Large Language Models

Abstract

Paper Structure (21 sections, 3 theorems, 33 equations, 5 figures, 2 tables)

This paper contains 21 sections, 3 theorems, 33 equations, 5 figures, 2 tables.

Introduction
Related Work
Activation Sparsity in LLMs.
Heavy-Tailed Self-Regularization Theory.
Methodology
Notation
TopK Input Activation Sparsity during Inference
Heavy-Tailed Self-Regularization Theory (HT-SR)
Motivation: from HT-SR theory to activation sparsity.
Heavy-tailed Metric.
ActTail: Global Activation Sparsity Allocation
Theoretical Analysis
Results
Experimental Settings
Models and Evaluation.
...and 6 more sections

Key Result

Lemma 4.1

Let $d\ge 1$ and $p\in(2,\infty)$. For $y\in\mathbb{R}^d$, let $|y|_{(1)}\ge\cdots\ge |y|_{(d)}$ denote the decreasing rearrangement of $\{|y_1|,\ldots,|y_d|\}$. Let $T_K(y)$ be the TopK truncation that keeps the $K$ largest-magnitude coordinates of $y$ and sets the rest to zero. Then for any $K\in\ where $C_p>0$. $\|\cdot\|_{p,\infty}$ is the weak-$\ell_p$ (Lorentz) quasi-norm $\|y\|_{p,\infty}:=

Figures (5)

Figure 1: Eigenvalue spectral density (ESD) for MLP gate projection and attention output projection in Layer 1 of Llama2-13B model.
Figure 2: Overview of the ActTail workflow. ActTail first estimates heavy-tailed exponents from the empirical spectral density (ESD) of each projection, maps them to projection-level sparsity ratios via a linear rule, and applies TopK activation sparsity during inference.
Figure 3: Power-law exponents $\alpha$ (Hill estimator) of the ESDs for different modules across layers in LLaMA2-7B. Smaller $\alpha$ indicates heavier tails.
Figure 4: Alpha and sparsity per projection module across layers.
Figure 5: Alpha values and sparsity ratios per projection, ordered by layer and module type (Q, K, V, O, Gate, Up, Down).

Theorems & Definitions (6)

Lemma 4.1: $\ell_2$ Best $K$-term approximation
proof
Lemma 4.2: power-law tail integral
proof
Theorem 4.4: Spectral energy concentration
proof

ActTail: Global Activation Sparsity in Large Language Models

Abstract

ActTail: Global Activation Sparsity in Large Language Models

Authors

Abstract

Table of Contents

Key Result

Figures (5)

Theorems & Definitions (6)