Bi-Mamba: Towards Accurate 1-Bit State Space Models
Shengkun Tang, Liqun Ma, Haonan Li, Mingjie Sun, Zhiqiang Shen
TL;DR
Bi-Mamba presents a scalable 1-bit binarization of the Mamba State Space Model, preserving competitive performance while achieving linear-time complexity in sequence processing. By binarizing only the dominant linear modules via FBI-Linear and training with autoregressive distillation from a high-quality teacher, Bi-Mamba achieves strong zero-shot and perplexity results across 780M, 1.3B, and 2.7B parameter scales. The approach yields substantial reductions in memory, storage, and energy, with favorable results on multilingual and instruction-tuned settings, suggesting practical viability for resource-constrained deployment. This work establishes a foundational approach for low-bit, linearly-scaling LLMs and motivates hardware optimized for 1-bit Mamba-based architectures.
Abstract
The typical Selective State-Space Model (SSM) used in Mamba addresses several limitations of Transformers, such as the quadratic computational complexity with respect to sequence length and the significant memory requirements during inference due to the key-value (KV) cache. However, the increasing size of Mamba models continues to pose challenges for training and deployment, particularly due to their substantial computational demands during both training and inference. In this work, we introduce $\texttt{Bi-Mamba}$, a scalable and powerful 1-bit Mamba architecture designed to enable more efficient large language models (LLMs), with model sizes of 780M, 1.3B, and 2.7B parameters. $\texttt{Bi-Mamba}$ models are trained from scratch on a standard LLM-scale dataset using an autoregressive distillation loss. Extensive experiments on language modeling benchmarks demonstrate that $\texttt{Bi-Mamba}$ achieves performance comparable to its full-precision (FP16 or BF16) counterparts, while outperforming post-training binarization (PTB) Mamba and binarization-aware training (BAT) Transformer baselines. Moreover, $\texttt{Bi-Mamba}$ drastically reduces memory usage and computational cost compared to the original Mamba. Our work pioneers a new line of linear-complexity LLMs under low-bit representation and provides the way for the design of specialized hardware optimized for efficient 1-bit Mamba-based models. Code and the pre-trained weights are available at https://github.com/Tangshengku/Bi-Mamba.
