Table of Contents
Fetching ...

ProSparse: Introducing and Enhancing Intrinsic Activation Sparsity within Large Language Models

Chenyang Song, Xu Han, Zhengyan Zhang, Shengding Hu, Xiyu Shi, Kuai Li, Chen Chen, Zhiyuan Liu, Guangli Li, Tao Yang, Maosong Sun

TL;DR

<3-5 sentence high-level summary> ProSparse addresses the lack of intrinsic activation sparsity in non-ReLU LLMs by introducing a three-step ReLUfication pipeline: activation function substitution to ReLU, progressively scheduled $L_1$ regularization, and activation threshold shifting with FATReLU. It achieves high activation sparsity (around 87–89%) on open-source models like LLaMA2-7B/13B and MiniCPM-1B while preserving downstream performance, outperforming prior ReLU-based baselines. The work also demonstrates practical inference acceleration via approximate (PowerInfer) and accurate GPU-based approaches, with speedups up to 4.52x. Analyses of sparsity dynamics, scheduling, SFT on sparse models, and dataset/layer distributions provide guidance for deployment and future hardware-software co-design. Overall, ProSparse offers a scalable, controllable route to exploit activation sparsity for efficient LLM inference on open models.</3-5 sentence high-level summary>

Abstract

Activation sparsity refers to the existence of considerable weakly-contributed elements among activation outputs. As a prevalent property of the models using the ReLU activation function, activation sparsity has been proven a promising paradigm to boost model inference efficiency. Nevertheless, most large language models (LLMs) adopt activation functions without intrinsic activation sparsity (e.g., GELU and Swish). Some recent efforts have explored introducing ReLU or its variants as the substitutive activation function to help LLMs achieve activation sparsity and inference acceleration, but few can simultaneously obtain high sparsity and comparable model performance. This paper introduces a simple and effective sparsification method named "ProSparse" to push LLMs for higher activation sparsity while maintaining comparable performance. Specifically, after substituting the activation function of LLMs with ReLU, ProSparse adopts progressive sparsity regularization with a factor smoothly increasing along the multi-stage sine curves. This can enhance activation sparsity and mitigate performance degradation by avoiding radical shifts in activation distributions. With ProSparse, we obtain high sparsity of 89.32% for LLaMA2-7B, 88.80% for LLaMA2-13B, and 87.89% for end-size MiniCPM-1B, respectively, achieving comparable performance to their original Swish-activated versions. These present the most sparsely activated models among open-source LLaMA versions and competitive end-size models, considerably surpassing ReluLLaMA-7B (66.98%) and ReluLLaMA-13B (71.56%). Our inference acceleration experiments further demonstrate the significant practical acceleration potential of LLMs with higher activation sparsity, obtaining up to 4.52$\times$ inference speedup.

ProSparse: Introducing and Enhancing Intrinsic Activation Sparsity within Large Language Models

TL;DR

<3-5 sentence high-level summary> ProSparse addresses the lack of intrinsic activation sparsity in non-ReLU LLMs by introducing a three-step ReLUfication pipeline: activation function substitution to ReLU, progressively scheduled regularization, and activation threshold shifting with FATReLU. It achieves high activation sparsity (around 87–89%) on open-source models like LLaMA2-7B/13B and MiniCPM-1B while preserving downstream performance, outperforming prior ReLU-based baselines. The work also demonstrates practical inference acceleration via approximate (PowerInfer) and accurate GPU-based approaches, with speedups up to 4.52x. Analyses of sparsity dynamics, scheduling, SFT on sparse models, and dataset/layer distributions provide guidance for deployment and future hardware-software co-design. Overall, ProSparse offers a scalable, controllable route to exploit activation sparsity for efficient LLM inference on open models.</3-5 sentence high-level summary>

Abstract

Activation sparsity refers to the existence of considerable weakly-contributed elements among activation outputs. As a prevalent property of the models using the ReLU activation function, activation sparsity has been proven a promising paradigm to boost model inference efficiency. Nevertheless, most large language models (LLMs) adopt activation functions without intrinsic activation sparsity (e.g., GELU and Swish). Some recent efforts have explored introducing ReLU or its variants as the substitutive activation function to help LLMs achieve activation sparsity and inference acceleration, but few can simultaneously obtain high sparsity and comparable model performance. This paper introduces a simple and effective sparsification method named "ProSparse" to push LLMs for higher activation sparsity while maintaining comparable performance. Specifically, after substituting the activation function of LLMs with ReLU, ProSparse adopts progressive sparsity regularization with a factor smoothly increasing along the multi-stage sine curves. This can enhance activation sparsity and mitigate performance degradation by avoiding radical shifts in activation distributions. With ProSparse, we obtain high sparsity of 89.32% for LLaMA2-7B, 88.80% for LLaMA2-13B, and 87.89% for end-size MiniCPM-1B, respectively, achieving comparable performance to their original Swish-activated versions. These present the most sparsely activated models among open-source LLaMA versions and competitive end-size models, considerably surpassing ReluLLaMA-7B (66.98%) and ReluLLaMA-13B (71.56%). Our inference acceleration experiments further demonstrate the significant practical acceleration potential of LLMs with higher activation sparsity, obtaining up to 4.52 inference speedup.
Paper Structure (45 sections, 3 equations, 4 figures, 9 tables, 1 algorithm)

This paper contains 45 sections, 3 equations, 4 figures, 9 tables, 1 algorithm.

Figures (4)

  • Figure 1: The overall architecture of ProSparse, which includes three steps: activation function substitution, progressive sparsity regularization, and activation threshold shifting.
  • Figure 2: The trend of sparsity (7B models) along the training process. "Shifted" denotes Shited ReLU and $b=0.1$ corresponds to the results in Table \ref{['tab:baseline']}.
  • Figure 3: The activation sparsity obtained by applying different final-stage regularization factors $\lambda_S$ to the checkpoints at different training stages (16,500 steps in total) of ProSparse-7B.
  • Figure 4: The layer-wise sparsity of ProSparse models. The marker "$^*$" denotes the settings without activation threshold shifting.