Table of Contents
Fetching ...

The Lazy Neuron Phenomenon: On Emergence of Activation Sparsity in Transformers

Zonglin Li, Chong You, Srinadh Bhojanapalli, Daliang Li, Ankit Singh Rawat, Sashank J. Reddi, Ke Ye, Felix Chern, Felix Yu, Ruiqi Guo, Sanjiv Kumar

TL;DR

The paper reveals that activation maps in Transformer MLPs become highly sparse after training, with sparsity increasing in larger models and across layers, and shows this phenomenon across NLP and vision tasks even with random data conditions. It argues that sparsity largely emerges from training dynamics rather than data properties, supported by a gradient-based theoretical insight at initialization. The authors demonstrate practical benefits, including potential FLOPs reductions, and introduce Top-k Transformers to enforce sparsity, achieving comparable accuracy while enhancing robustness and calibration. They also discuss broader implications for efficiency, hardware support for sparse computation, and the idea that Transformers are inherently parsimonious models.

Abstract

This paper studies the curious phenomenon for machine learning models with Transformer architectures that their activation maps are sparse. By activation map we refer to the intermediate output of the multi-layer perceptrons (MLPs) after a ReLU activation function, and by sparse we mean that on average very few entries (e.g., 3.0% for T5-Base and 6.3% for ViT-B16) are nonzero for each input to MLP. Moreover, larger Transformers with more layers and wider MLP hidden dimensions are sparser as measured by the percentage of nonzero entries. Through extensive experiments we demonstrate that the emergence of sparsity is a prevalent phenomenon that occurs for both natural language processing and vision tasks, on both training and evaluation data, for Transformers of various configurations, at layers of all depth levels, as well as for other architectures including MLP-mixers and 2-layer MLPs. We show that sparsity also emerges using training datasets with random labels, or with random inputs, or with infinite amount of data, demonstrating that sparsity is not a result of a specific family of datasets. We discuss how sparsity immediately implies a way to significantly reduce the FLOP count and improve efficiency for Transformers. Moreover, we demonstrate perhaps surprisingly that enforcing an even sparser activation via Top-k thresholding with a small value of k brings a collection of desired but missing properties for Transformers, namely less sensitivity to noisy training data, more robustness to input corruptions, and better calibration for their prediction confidence.

The Lazy Neuron Phenomenon: On Emergence of Activation Sparsity in Transformers

TL;DR

The paper reveals that activation maps in Transformer MLPs become highly sparse after training, with sparsity increasing in larger models and across layers, and shows this phenomenon across NLP and vision tasks even with random data conditions. It argues that sparsity largely emerges from training dynamics rather than data properties, supported by a gradient-based theoretical insight at initialization. The authors demonstrate practical benefits, including potential FLOPs reductions, and introduce Top-k Transformers to enforce sparsity, achieving comparable accuracy while enhancing robustness and calibration. They also discuss broader implications for efficiency, hardware support for sparse computation, and the idea that Transformers are inherently parsimonious models.

Abstract

This paper studies the curious phenomenon for machine learning models with Transformer architectures that their activation maps are sparse. By activation map we refer to the intermediate output of the multi-layer perceptrons (MLPs) after a ReLU activation function, and by sparse we mean that on average very few entries (e.g., 3.0% for T5-Base and 6.3% for ViT-B16) are nonzero for each input to MLP. Moreover, larger Transformers with more layers and wider MLP hidden dimensions are sparser as measured by the percentage of nonzero entries. Through extensive experiments we demonstrate that the emergence of sparsity is a prevalent phenomenon that occurs for both natural language processing and vision tasks, on both training and evaluation data, for Transformers of various configurations, at layers of all depth levels, as well as for other architectures including MLP-mixers and 2-layer MLPs. We show that sparsity also emerges using training datasets with random labels, or with random inputs, or with infinite amount of data, demonstrating that sparsity is not a result of a specific family of datasets. We discuss how sparsity immediately implies a way to significantly reduce the FLOP count and improve efficiency for Transformers. Moreover, we demonstrate perhaps surprisingly that enforcing an even sparser activation via Top-k thresholding with a small value of k brings a collection of desired but missing properties for Transformers, namely less sensitivity to noisy training data, more robustness to input corruptions, and better calibration for their prediction confidence.
Paper Structure (33 sections, 2 theorems, 27 equations, 25 figures, 4 tables)

This paper contains 33 sections, 2 theorems, 27 equations, 25 figures, 4 tables.

Key Result

Theorem 3.1

Let $f(\boldsymbol{x} ; \boldsymbol{V}, \boldsymbol\theta): I\!\!R^n \to I\!\!R^K$ be a neural network given by where $\boldsymbol{V} = [\boldsymbol{v}_1, \ldots, \boldsymbol{v}_{d_\text{ff}}]\in I\!\!R^{K \times d_\text{ff}}$ is network parameter for the last layer drawn from a random distribution, $\sigma()$ is the ReLU activation function, and $\boldsymbol{p}(\boldsymbol{x}; \boldsymbol \theta

Figures (25)

  • Figure 1: Percentage of nonzero entries (y-axis, log scale) in the activation map as a function of number of training steps (x-axis) for a T5-Base model trained with the span corruption objective on the C4 dataset. Left: layers (from shallow to deep) of the encoder. Right: layers of the decoder.
  • Figure 2: Percentage of nonzero entries across different layers of trained Transformers (a) for both language data with T5 and vision data with ViT, (b) on both training and evaluation data, (c) for ViT trained on two ImageNet of different scales (21k vs 1k classes), (d) on ViT of varying configurations, and (e, f) on T5 of varying configurations. Please note that the y-axis is in $\log$ scale. Sparsity emerges in all cases.
  • Figure 3: Percentage of times that each neuron in the first MLP layer of a trained T5 is activated on C4 dataset.
  • Figure 4: Activation sparsity across different encoder layers of trained T5 Transformers of (a) varying depth and (b, c) varying width (i.e., $d_\text{ff}$). Since with varying width the dimension of activation maps also changes, we evaluate sparsity both in term of the percentage (as in (b)) and the count (as in (c)) of nonzeros. Deeper and wider models are sparser in terms of percentage of activated neurons.
  • Figure 5: Percentage of nonzero entries in ViT trained on ImageNet-21k (IM-21K) with (a) random labels where $p\%$ labels are replaced by labels drawn from a uniform distribution with $p \in \{50\%, 70\%, 100\%\}$, (b) random images where each image is replaced by one where the pixels are drawn from i.i.d. uniform distribution in $[-1, 1]$, and (c) infinite data where sufficient training data is generated by drawing random image and random label pairs so that the model is never trained on the same pair twice.
  • ...and 20 more figures

Theorems & Definitions (4)

  • Theorem 3.1
  • proof : Proof of Theorem \ref{['thm:gradient-on-activation']}
  • Lemma D.1
  • proof