Q-Sparse: All Large Language Models can be Fully Sparsely-Activated
Hongyu Wang, Shuming Ma, Ruiping Wang, Furu Wei
TL;DR
Q-Sparse enables all large language models to operate with full activation sparsity via top-K sparsification and straight-through estimation, yielding inference-time efficiency while maintaining competitive accuracy. The work extends sparsity to quantized and 1-bit weight regimes (e.g., BitNet b1.58) and introduces Block Q-Sparse for batch processing. A joint inference-optimal scaling law is proposed, combining a power-law in model size $N$ and an exponential in sparsity $S$, with an optimal $S^*$ around 45.58% (FP) or 61.25% (1.58-bit) under a fixed inference budget $N_a$, enabling substantial compute savings. Empirical results across training-from-scratch, continue-training, and supervised finetuning demonstrate that Q-Sparse matches or surpasses dense baselines with far fewer activated parameters, and it synergizes with BitNet and MoE to significantly improve efficiency and energy usage in future LLM deployments.
Abstract
We introduce, Q-Sparse, a simple yet effective approach to training sparsely-activated large language models (LLMs). Q-Sparse enables full sparsity of activations in LLMs which can bring significant efficiency gains in inference. This is achieved by applying top-K sparsification to the activations and the straight-through-estimator to the training. We also introduce Block Q-Sparse for batch training and inference. The key results from this work are, (1) Q-Sparse can achieve results comparable to those of baseline LLMs while being much more efficient at inference time; (2) We present an inference-optimal scaling law for sparsely-activated LLMs; (3) Q-Sparse is effective in different settings, including training-from-scratch, continue-training of off-the-shelf LLMs, and finetuning; (4) Q-Sparse works for both full-precision and 1-bit LLMs (e.g., BitNet b1.58). Particularly, the synergy of BitNet b1.58 and Q-Sparse (can be equipped with MoE) provides the cornerstone and a clear path to revolutionize the efficiency, including cost and energy consumption, of future LLMs.
