Q-Sparse: All Large Language Models can be Fully Sparsely-Activated

Hongyu Wang; Shuming Ma; Ruiping Wang; Furu Wei

Q-Sparse: All Large Language Models can be Fully Sparsely-Activated

Hongyu Wang, Shuming Ma, Ruiping Wang, Furu Wei

TL;DR

Q-Sparse enables all large language models to operate with full activation sparsity via top-K sparsification and straight-through estimation, yielding inference-time efficiency while maintaining competitive accuracy. The work extends sparsity to quantized and 1-bit weight regimes (e.g., BitNet b1.58) and introduces Block Q-Sparse for batch processing. A joint inference-optimal scaling law is proposed, combining a power-law in model size $N$ and an exponential in sparsity $S$, with an optimal $S^*$ around 45.58% (FP) or 61.25% (1.58-bit) under a fixed inference budget $N_a$, enabling substantial compute savings. Empirical results across training-from-scratch, continue-training, and supervised finetuning demonstrate that Q-Sparse matches or surpasses dense baselines with far fewer activated parameters, and it synergizes with BitNet and MoE to significantly improve efficiency and energy usage in future LLM deployments.

Abstract

We introduce, Q-Sparse, a simple yet effective approach to training sparsely-activated large language models (LLMs). Q-Sparse enables full sparsity of activations in LLMs which can bring significant efficiency gains in inference. This is achieved by applying top-K sparsification to the activations and the straight-through-estimator to the training. We also introduce Block Q-Sparse for batch training and inference. The key results from this work are, (1) Q-Sparse can achieve results comparable to those of baseline LLMs while being much more efficient at inference time; (2) We present an inference-optimal scaling law for sparsely-activated LLMs; (3) Q-Sparse is effective in different settings, including training-from-scratch, continue-training of off-the-shelf LLMs, and finetuning; (4) Q-Sparse works for both full-precision and 1-bit LLMs (e.g., BitNet b1.58). Particularly, the synergy of BitNet b1.58 and Q-Sparse (can be equipped with MoE) provides the cornerstone and a clear path to revolutionize the efficiency, including cost and energy consumption, of future LLMs.

Q-Sparse: All Large Language Models can be Fully Sparsely-Activated

TL;DR

and an exponential in sparsity

, with an optimal

around 45.58% (FP) or 61.25% (1.58-bit) under a fixed inference budget

, enabling substantial compute savings. Empirical results across training-from-scratch, continue-training, and supervised finetuning demonstrate that Q-Sparse matches or surpasses dense baselines with far fewer activated parameters, and it synergizes with BitNet and MoE to significantly improve efficiency and energy usage in future LLM deployments.

Abstract

Paper Structure (31 sections, 23 equations, 9 figures, 8 tables)

This paper contains 31 sections, 23 equations, 9 figures, 8 tables.

Fully Sparsely-Activated LLMs
Q-Sparse
Architecture
Training
Q-Sparse for Continue-Train and Finetuning Settings
Scaling Laws
Scaling Experiments and Findings
Power Law in the Model Size $N$
Exponential Law in the Sparsity Ratio $S$
Fitting the Parameters
Diminishing Gap between Sparsely-Activated Models and Dense Baselines
Inference-Optimal Scaling Law
Experiments
Training-from-Scratch
Setting
...and 16 more sections

Figures (9)

Figure 2: The average magnitude of each projection's gradient of dense baseline, Q-Sparse with and without STE across different layers. The visualization is conducted with 300M model size on a subset of the valid set of C4 c4. It shows that the gradient vanishes without STE.
Figure 3: The scaling curves of the sparsely-activated models regrading to the model size given a fixed sparsity ratio $S$ (Left), and regrading to the sparsity ratio given a fixed model size $N$ (Right).
Figure 4: The inference-optimal scaling curves of the sparsely-activated models with full-precision (Top) and 1.58-bit (Bottom) weight. It shows that a sparisty of 45.58% for full-precision models and 61.25% for 1.58-bit models can achieve the best performance with the same inference compute budget (i.e., activated parameters or FLOPs).
Figure 5: The training loss curve of Q-Sparse and the baseline with full-precision. We adopt top-$K$ as 70% for Q-Sparse, resulting in 40% overall sparsity.
Figure 6: The training loss curve of Q-Sparse and the baseline with 1.58-bit weight. We adopt top-$K$ as 70% for Q-Sparse, resulting in 40% overall sparsity.
...and 4 more figures

Q-Sparse: All Large Language Models can be Fully Sparsely-Activated

TL;DR

Abstract

Q-Sparse: All Large Language Models can be Fully Sparsely-Activated

Authors

TL;DR

Abstract

Table of Contents

Figures (9)