Table of Contents
Fetching ...

Sparse-BitNet: 1.58-bit LLMs are Naturally Friendly to Semi-Structured Sparsity

Di Zhang, Xun Wu, Shaohan Huang, Yudong Wang, Hanyong Shao, Yingbo Hao, Zewen Chi, Li Dong, Ting Song, Yan Xia, Zhifang Sui, Furu Wei

TL;DR

This work proposes Sparse-BitNet, a unified framework that jointly applies 1.58-bit quantization and dynamic N:M sparsification while ensuring stable training for the first time and highlights that combining extremely low-bit quantization with semi-structured N:M sparsity is a promising direction for efficient LLMs.

Abstract

Semi-structured N:M sparsity and low-bit quantization (e.g., 1.58-bit BitNet) are two promising approaches for improving the efficiency of large language models (LLMs), yet they have largely been studied in isolation. In this work, we investigate their interaction and show that 1.58-bit BitNet is naturally more compatible with N:M sparsity than full-precision models. To study this effect, we propose Sparse-BitNet, a unified framework that jointly applies 1.58-bit quantization and dynamic N:M sparsification while ensuring stable training for the first time. Across multiple model scales and training regimes (sparse pretraining and dense-to-sparse schedules), 1.58-bit BitNet consistently exhibits smaller performance degradation than full-precision baselines at the same sparsity levels and can tolerate higher structured sparsity before accuracy collapse. Moreover, using our custom sparse tensor core, Sparse-BitNet achieves substantial speedups in both training and inference, reaching up to 1.30X. These results highlight that combining extremely low-bit quantization with semi-structured N:M sparsity is a promising direction for efficient LLMs. Code available at https://github.com/AAzdi/Sparse-BitNet

Sparse-BitNet: 1.58-bit LLMs are Naturally Friendly to Semi-Structured Sparsity

TL;DR

This work proposes Sparse-BitNet, a unified framework that jointly applies 1.58-bit quantization and dynamic N:M sparsification while ensuring stable training for the first time and highlights that combining extremely low-bit quantization with semi-structured N:M sparsity is a promising direction for efficient LLMs.

Abstract

Semi-structured N:M sparsity and low-bit quantization (e.g., 1.58-bit BitNet) are two promising approaches for improving the efficiency of large language models (LLMs), yet they have largely been studied in isolation. In this work, we investigate their interaction and show that 1.58-bit BitNet is naturally more compatible with N:M sparsity than full-precision models. To study this effect, we propose Sparse-BitNet, a unified framework that jointly applies 1.58-bit quantization and dynamic N:M sparsification while ensuring stable training for the first time. Across multiple model scales and training regimes (sparse pretraining and dense-to-sparse schedules), 1.58-bit BitNet consistently exhibits smaller performance degradation than full-precision baselines at the same sparsity levels and can tolerate higher structured sparsity before accuracy collapse. Moreover, using our custom sparse tensor core, Sparse-BitNet achieves substantial speedups in both training and inference, reaching up to 1.30X. These results highlight that combining extremely low-bit quantization with semi-structured N:M sparsity is a promising direction for efficient LLMs. Code available at https://github.com/AAzdi/Sparse-BitNet
Paper Structure (23 sections, 6 equations, 6 figures, 6 tables, 1 algorithm)

This paper contains 23 sections, 6 equations, 6 figures, 6 tables, 1 algorithm.

Figures (6)

  • Figure 1: Intrinsic Sparsity in 1.58-bit BitNet. We present the aggregated weight statistics averaged across all linear layers of the pre-trained 1.58-bit BitNet (2B) model ma2025bitnetb158. (A) The distribution of normalized latent weights exhibits a distinct quantization-valley structure, where the majority of values fall within the [-0.5, 0.5] rounding interval. (B) Consequently, the quantized discrete states are dominated by zeros (approx. 42.3%), confirming that BitNet naturally converges to a highly sparse representation without explicit pruning.
  • Figure 2: Comparison of N:M sparsity friendliness between 1.58-bit BitNet and full-precision models. Normalized PPL increase relative to each method's dense (8:8) counterpart. The dashed line marks a 10% degradation threshold. At 2:4 (50% sparsity; same ratio as 4:8), BF16 exceeds the threshold (+18.8%) while BitNet remains below it (+5.7%), indicating 1.58-bit BitNet is more sparsity-friendly than full-precision models.
  • Figure 3: Ablation on training design choices for dynamic $6{:}8$ sparsity (Qwen2.5-0.5B).(a) Validation perplexity curves under different training design choices. (b) Corresponding mask flip rate $r_t$ (Eq. \ref{['flip_rate']}), reflecting the stability of sparsity pattern evolution during training.
  • Figure 4: Polarization trend under ternary QAT. (a) BF16 maintains a concentration around zero. (b) BitNet shows decreasing near-zero mass, indicating strong polarization over time.
  • Figure 5: Weight Distribution. Global histogram of linear-layer master weights at the final checkpoint. Unlike the unimodal distribution of BF16, BitNet displays a structured, multi-modal magnitude landscape, confirming the intrinsic sparsity property.
  • ...and 1 more figures