Table of Contents
Fetching ...

Universal Properties of Activation Sparsity in Modern Large Language Models

Filip Szatkowski, Patryk Będkowski, Alessio Devoto, Jan Dubiński, Pasquale Minervini, Mikołaj Piórczyński, Simone Scardapane, Bartosz Wójcik

TL;DR

The paper addresses the gap in understanding activation sparsity in modern LLMs by introducing a simple, training-free framework built around a top-p sparsification rule and the notion of critical sparsity, the maximum sparsity that preserves at least $1\%$ of performance. It systematically evaluates sparsity tolerance across diverse model families (including GLU-based FFNs and diffusion LLMs), scales, and training regimes, revealing universal patterns: larger models tolerate higher sparsity, input activations often drive sparsity more than gates or up-projections, and sparsity dynamics persist across MoEs and diffusion architectures. The findings highlight that activation sparsity is a robust, scale-dependent property with practical acceleration potential, especially when using input-based sparsification, and that diffusion LLMs may exhibit even stronger sparsity tolerance but require diffusion-specific strategies. Overall, the work provides practical guidelines for leveraging activation sparsity to improve efficiency in large-scale LLMs without retraining, along with a reproducible framework applicable across architectures and tasks.

Abstract

Activation sparsity is an intriguing property of deep neural networks that has been extensively studied in ReLU-based models, due to its advantages for efficiency, robustness, and interpretability. However, methods relying on exact zero activations do not directly apply to modern Large Language Models (LLMs), leading to fragmented, model-specific strategies for LLM activation sparsity and a gap in its general understanding. In this work, we introduce a general framework for evaluating sparsity robustness in contemporary LLMs and conduct a systematic investigation of this phenomenon in their feedforward~(FFN) layers. Our results uncover universal properties of activation sparsity across diverse model families and scales. Importantly, we observe that the potential for effective activation sparsity grows with model size, highlighting its increasing relevance as models scale. Furthermore, we present the first study of activation sparsity in diffusion-based LLMs. Overall, our work provides a comprehensive perspective and practical guidance for harnessing activation sparsity in LLM design and acceleration.

Universal Properties of Activation Sparsity in Modern Large Language Models

TL;DR

The paper addresses the gap in understanding activation sparsity in modern LLMs by introducing a simple, training-free framework built around a top-p sparsification rule and the notion of critical sparsity, the maximum sparsity that preserves at least of performance. It systematically evaluates sparsity tolerance across diverse model families (including GLU-based FFNs and diffusion LLMs), scales, and training regimes, revealing universal patterns: larger models tolerate higher sparsity, input activations often drive sparsity more than gates or up-projections, and sparsity dynamics persist across MoEs and diffusion architectures. The findings highlight that activation sparsity is a robust, scale-dependent property with practical acceleration potential, especially when using input-based sparsification, and that diffusion LLMs may exhibit even stronger sparsity tolerance but require diffusion-specific strategies. Overall, the work provides practical guidelines for leveraging activation sparsity to improve efficiency in large-scale LLMs without retraining, along with a reproducible framework applicable across architectures and tasks.

Abstract

Activation sparsity is an intriguing property of deep neural networks that has been extensively studied in ReLU-based models, due to its advantages for efficiency, robustness, and interpretability. However, methods relying on exact zero activations do not directly apply to modern Large Language Models (LLMs), leading to fragmented, model-specific strategies for LLM activation sparsity and a gap in its general understanding. In this work, we introduce a general framework for evaluating sparsity robustness in contemporary LLMs and conduct a systematic investigation of this phenomenon in their feedforward~(FFN) layers. Our results uncover universal properties of activation sparsity across diverse model families and scales. Importantly, we observe that the potential for effective activation sparsity grows with model size, highlighting its increasing relevance as models scale. Furthermore, we present the first study of activation sparsity in diffusion-based LLMs. Overall, our work provides a comprehensive perspective and practical guidance for harnessing activation sparsity in LLM design and acceleration.

Paper Structure

This paper contains 36 sections, 9 equations, 18 figures, 2 tables.

Figures (18)

  • Figure 1: Common strategies for exploiting activation sparsity to skip redundant computations in GLU-based FFN modules, with the origins of the sparse activation masks denoted with red borders. Input-based methods skip parts of the matrix multiplications corresponding to the low-magnitude components in the input vectors across all three linear layers. Gate-based and predictor-based methods instead omit computations associated with values that are either negligible in the gate activation vector or predicted to be negligible by an auxiliary predictor module.
  • Figure 2: Average accuracy across downstream tasks with different induced activation sparsity for base Gemma3 models. We normalize the accuracy by the original performance of the dense models, and denote the highest (critical) sparsity where at least 99% performance is retained with a marker.
  • Figure 3: Activation sparsity becomes more pronounced as model size increases. a) Average critical sparsity of FFN components across models, with least-squares trend lines. Larger models generally tolerate higher sparsity, suggesting greater potential benefits. b) Effective ranks (roy2007effective) of activations on Winogrande, normalized by activation dimension. Larger models show lower effective dimensions, indicating greater redundancy available for sparsification. c) Critical sparsity under All-Inputs sparsification and the corresponding $\operatorname{top-p}$ thresholds at which performance degrades. Results are averaged across evaluation tasks, with marker size indicating model size.
  • Figure 4: Critical sparsity for Gemma3 models (1B, 4B, 12B, and 27B) across all evaluated modules and tasks. Marker size represents model scale, and tasks with higher accuracy are positioned toward the top. While tasks with higher baseline accuracy generally tolerate sparser activations, critical sparsity varies substantially across tasks, highlighting that activation sparsity is task-dependent.
  • Figure 5: Activation sparsity is a prevalent property across different model types. a) Critical sparsity levels for pretrained and instruction-tuned Gemma3 models. b) Performance of instruction-tuned and reasoning variants of Qwen3-4B on generative tasks assessing general knowledge, mathematics, and factual accuracy, with critical sparsity indicated by the markers. c) Activation sparsity in LLaMA-8B and the diffusion-based LLaDA-8B, with critical sparsity similarly marked. We report normalized accuracy as the accuracy of sparsified model divided by the accuracy of the dense version.
  • ...and 13 more figures