Turbo Sparse: Achieving LLM SOTA Performance with Minimal Activated Parameters

Yixin Song; Haotong Xie; Zhengyan Zhang; Bo Wen; Li Ma; Zeyu Mi; Haibo Chen

Turbo Sparse: Achieving LLM SOTA Performance with Minimal Activated Parameters

Yixin Song, Haotong Xie, Zhengyan Zhang, Bo Wen, Li Ma, Zeyu Mi, Haibo Chen

TL;DR

The paper addresses the inefficiency of dense LLM inference by introducing a dReLU-based sparsification strategy that, together with diverse pretraining data and MoE FFN sparsity, activates only a small subset of neurons per inference. It demonstrates that applying dReLU after both gate and up projections yields sparsity near 90% without sacrificing performance, and shows MoE FFNs retain sparse activations, enabling large speedups in both dense and MoE architectures. Empirical results on Mistral-7B and Mixtral-47B show 2-5× decoding speedups, with mobile deployment reaching 11 tokens/s, and substantial neuron-level sparsity (up to ~90%) that dramatically reduces FLOPs. The approach is validated across downstream tasks and benchmarks, outperforming several baselines while remaining practical for deployment, including on consumer hardware, and the authors release the sparsified TurboSparse models for broader use.

Abstract

Exploiting activation sparsity is a promising approach to significantly accelerating the inference process of large language models (LLMs) without compromising performance. However, activation sparsity is determined by activation functions, and commonly used ones like SwiGLU and GeGLU exhibit limited sparsity. Simply replacing these functions with ReLU fails to achieve sufficient sparsity. Moreover, inadequate training data can further increase the risk of performance degradation. To address these challenges, we propose a novel dReLU function, which is designed to improve LLM activation sparsity, along with a high-quality training data mixture ratio to facilitate effective sparsification. Additionally, we leverage sparse activation patterns within the Feed-Forward Network (FFN) experts of Mixture-of-Experts (MoE) models to further boost efficiency. By applying our neuron sparsification method to the Mistral and Mixtral models, only 2.5 billion and 4.3 billion parameters are activated per inference iteration, respectively, while achieving even more powerful model performance. Evaluation results demonstrate that this sparsity achieves a 2-5x decoding speedup. Remarkably, on mobile phones, our TurboSparse-Mixtral-47B achieves an inference speed of 11 tokens per second. Our models are available at \url{https://huggingface.co/PowerInfer}

Turbo Sparse: Achieving LLM SOTA Performance with Minimal Activated Parameters

TL;DR

Abstract

Paper Structure (38 sections, 4 equations, 6 figures, 13 tables)

This paper contains 38 sections, 4 equations, 6 figures, 13 tables.

Introduction
Related Work and Background
Efficient Inference of LLMs.
Mixture-of-Experts (MoE).
Intrinsic Activation Sparsity.
Gated-MLP Blocks.
Analysis
Limitations about Existing ReLUfication
dReLU
Are Neurons in Expert still Sparsely Activated?
Models.
dReLU Sparsification
Experimental setup.
Pretraining datasets.
SFT datasets.
...and 23 more sections

Figures (6)

Figure 2: Example of dReLU sparsification. The left figure illustrates the original dense activation where every input activates all neurons, while the right is our sparsified LLMs, where each input activates only a small subset of neurons.
Figure 3: Post-activation distribution of ReLULlama and Llama-2-7B in layer 0.
Figure 4: Training loss of small models with different activation functions.
Figure 5: (a) Performance of MoE models with regard to activation sparsity. The impact of activation sparsity on the performance is negligible until the sparsity ratio is larger than 0.5. (b) Activation distribution of Mixtral and Mistral.
Figure 6: Sparsity of TurboSparse-Mistral-7B and TurboSparse-Mixtral-47B of different layers.
...and 1 more figures

Turbo Sparse: Achieving LLM SOTA Performance with Minimal Activated Parameters

TL;DR

Abstract

Turbo Sparse: Achieving LLM SOTA Performance with Minimal Activated Parameters

Authors

TL;DR

Abstract

Table of Contents

Figures (6)