ReLU$^2$ Wins: Discovering Efficient Activation Functions for Sparse LLMs
Zhengyan Zhang, Yixin Song, Guanghui Yu, Xu Han, Yankai Lin, Chaojun Xiao, Chenyang Song, Zhiyuan Liu, Zeyu Mi, Maosong Sun
TL;DR
This work broadens sparse activation for LLM inference beyond zeros by introducing a magnitude-threshold definition of neuron activation and evaluating multiple activation functions. It systematically analyzes sparsity against performance, predictivity of inactive neurons, and hardware affinity, identifying ReLU^2 as the most effective function for sparse deployment. Key findings include high sparsity with minimal performance loss, strong predictivity, and substantial I/O and FLOP reductions, suggesting practical gains for low-resource inference. The authors also provide a thresholding method based on CETT and plan to release code to support future research.
Abstract
Sparse computation offers a compelling solution for the inference of Large Language Models (LLMs) in low-resource scenarios by dynamically skipping the computation of inactive neurons. While traditional approaches focus on ReLU-based LLMs, leveraging zeros in activation values, we broaden the scope of sparse LLMs beyond zero activation values. We introduce a general method that defines neuron activation through neuron output magnitudes and a tailored magnitude threshold, demonstrating that non-ReLU LLMs also exhibit sparse activation. To find the most efficient activation function for sparse computation, we propose a systematic framework to examine the sparsity of LLMs from three aspects: the trade-off between sparsity and performance, the predictivity of sparsity, and the hardware affinity. We conduct thorough experiments on LLMs utilizing different activation functions, including ReLU, SwiGLU, ReGLU, and ReLU$^2$. The results indicate that models employing ReLU$^2$ excel across all three evaluation aspects, highlighting its potential as an efficient activation function for sparse LLMs. We will release the code to facilitate future research.
