Table of Contents
Fetching ...

ReLU$^2$ Wins: Discovering Efficient Activation Functions for Sparse LLMs

Zhengyan Zhang, Yixin Song, Guanghui Yu, Xu Han, Yankai Lin, Chaojun Xiao, Chenyang Song, Zhiyuan Liu, Zeyu Mi, Maosong Sun

TL;DR

This work broadens sparse activation for LLM inference beyond zeros by introducing a magnitude-threshold definition of neuron activation and evaluating multiple activation functions. It systematically analyzes sparsity against performance, predictivity of inactive neurons, and hardware affinity, identifying ReLU^2 as the most effective function for sparse deployment. Key findings include high sparsity with minimal performance loss, strong predictivity, and substantial I/O and FLOP reductions, suggesting practical gains for low-resource inference. The authors also provide a thresholding method based on CETT and plan to release code to support future research.

Abstract

Sparse computation offers a compelling solution for the inference of Large Language Models (LLMs) in low-resource scenarios by dynamically skipping the computation of inactive neurons. While traditional approaches focus on ReLU-based LLMs, leveraging zeros in activation values, we broaden the scope of sparse LLMs beyond zero activation values. We introduce a general method that defines neuron activation through neuron output magnitudes and a tailored magnitude threshold, demonstrating that non-ReLU LLMs also exhibit sparse activation. To find the most efficient activation function for sparse computation, we propose a systematic framework to examine the sparsity of LLMs from three aspects: the trade-off between sparsity and performance, the predictivity of sparsity, and the hardware affinity. We conduct thorough experiments on LLMs utilizing different activation functions, including ReLU, SwiGLU, ReGLU, and ReLU$^2$. The results indicate that models employing ReLU$^2$ excel across all three evaluation aspects, highlighting its potential as an efficient activation function for sparse LLMs. We will release the code to facilitate future research.

ReLU$^2$ Wins: Discovering Efficient Activation Functions for Sparse LLMs

TL;DR

This work broadens sparse activation for LLM inference beyond zeros by introducing a magnitude-threshold definition of neuron activation and evaluating multiple activation functions. It systematically analyzes sparsity against performance, predictivity of inactive neurons, and hardware affinity, identifying ReLU^2 as the most effective function for sparse deployment. Key findings include high sparsity with minimal performance loss, strong predictivity, and substantial I/O and FLOP reductions, suggesting practical gains for low-resource inference. The authors also provide a thresholding method based on CETT and plan to release code to support future research.

Abstract

Sparse computation offers a compelling solution for the inference of Large Language Models (LLMs) in low-resource scenarios by dynamically skipping the computation of inactive neurons. While traditional approaches focus on ReLU-based LLMs, leveraging zeros in activation values, we broaden the scope of sparse LLMs beyond zero activation values. We introduce a general method that defines neuron activation through neuron output magnitudes and a tailored magnitude threshold, demonstrating that non-ReLU LLMs also exhibit sparse activation. To find the most efficient activation function for sparse computation, we propose a systematic framework to examine the sparsity of LLMs from three aspects: the trade-off between sparsity and performance, the predictivity of sparsity, and the hardware affinity. We conduct thorough experiments on LLMs utilizing different activation functions, including ReLU, SwiGLU, ReGLU, and ReLU. The results indicate that models employing ReLU excel across all three evaluation aspects, highlighting its potential as an efficient activation function for sparse LLMs. We will release the code to facilitate future research.
Paper Structure (23 sections, 6 equations, 9 figures, 5 tables, 1 algorithm)

This paper contains 23 sections, 6 equations, 9 figures, 5 tables, 1 algorithm.

Figures (9)

  • Figure 1: (a) Distribution of normalized output magnitudes of neurons in LLaMA. This distribution is long-tailed. (b) The average magnitude of neuron output representations in LLaMA with regard to the layer index. With the increase of the layer index, the average output magnitude also grows. (c) Cumulative errors of tail truncation with regard to activation sparsity. (d) Performance of LLaMA with regard to activation sparsity. The impact of activation sparsity on the performance is negligible until the sparsity ratio is larger than $0.7$.
  • Figure 2: (a) Training loss of 1B models with different activation functions. (b) Performance dynamics of 1B models on evaluation datasets. When training tokens reach 100 billion, the performance of the models with SwiGLU, ReGLU and ReLU$^2$ is very close.
  • Figure 3: (a) Cumulative errors of tail truncation with regard to activation sparsity. With the same cumulative errors, the sparsity of ReLU$^2$ is much higher than that of other functions in most cases. (b) Performance of 1B models under different sparsity ratios. ReLU$^2$ achieves the best trade-off between performance and sparsity. The cumulative error of $0.2$ is an inflection point of model performance for all activation functions.
  • Figure 4: Activation sparsity of different LLaMAs with regard to the model scale. LLaMAs with SwiGLU have a similar tendency under different model scales while LLaMAs with ReGLU become sparser with the increase of model scale.
  • Figure 5: (a) Prediction recall of 1B models with the top-$k$ prediction strategy. For each token, the neurons with the top $20\%$ largest prediction scores are predicted to be active. Hence the prediction sparsity is fixed at $0.2$. (b) Prediction recall and (c) prediction sparsity of 1B models with the threshold-based prediction strategy. The neurons with the prediction scores larger than $0.5$ are predicted to be active. Under similar sparsity ratios, ReLU$^2$ has higher prediction recall and prediction sparsity than other activation functions.
  • ...and 4 more figures