Universal Properties of Activation Sparsity in Modern Large Language Models
Filip Szatkowski, Patryk Będkowski, Alessio Devoto, Jan Dubiński, Pasquale Minervini, Mikołaj Piórczyński, Simone Scardapane, Bartosz Wójcik
TL;DR
The paper addresses the gap in understanding activation sparsity in modern LLMs by introducing a simple, training-free framework built around a top-p sparsification rule and the notion of critical sparsity, the maximum sparsity that preserves at least $1\%$ of performance. It systematically evaluates sparsity tolerance across diverse model families (including GLU-based FFNs and diffusion LLMs), scales, and training regimes, revealing universal patterns: larger models tolerate higher sparsity, input activations often drive sparsity more than gates or up-projections, and sparsity dynamics persist across MoEs and diffusion architectures. The findings highlight that activation sparsity is a robust, scale-dependent property with practical acceleration potential, especially when using input-based sparsification, and that diffusion LLMs may exhibit even stronger sparsity tolerance but require diffusion-specific strategies. Overall, the work provides practical guidelines for leveraging activation sparsity to improve efficiency in large-scale LLMs without retraining, along with a reproducible framework applicable across architectures and tasks.
Abstract
Activation sparsity is an intriguing property of deep neural networks that has been extensively studied in ReLU-based models, due to its advantages for efficiency, robustness, and interpretability. However, methods relying on exact zero activations do not directly apply to modern Large Language Models (LLMs), leading to fragmented, model-specific strategies for LLM activation sparsity and a gap in its general understanding. In this work, we introduce a general framework for evaluating sparsity robustness in contemporary LLMs and conduct a systematic investigation of this phenomenon in their feedforward~(FFN) layers. Our results uncover universal properties of activation sparsity across diverse model families and scales. Importantly, we observe that the potential for effective activation sparsity grows with model size, highlighting its increasing relevance as models scale. Furthermore, we present the first study of activation sparsity in diffusion-based LLMs. Overall, our work provides a comprehensive perspective and practical guidance for harnessing activation sparsity in LLM design and acceleration.
