Table of Contents
Fetching ...

Bi-Mamba: Towards Accurate 1-Bit State Space Models

Shengkun Tang, Liqun Ma, Haonan Li, Mingjie Sun, Zhiqiang Shen

TL;DR

Bi-Mamba presents a scalable 1-bit binarization of the Mamba State Space Model, preserving competitive performance while achieving linear-time complexity in sequence processing. By binarizing only the dominant linear modules via FBI-Linear and training with autoregressive distillation from a high-quality teacher, Bi-Mamba achieves strong zero-shot and perplexity results across 780M, 1.3B, and 2.7B parameter scales. The approach yields substantial reductions in memory, storage, and energy, with favorable results on multilingual and instruction-tuned settings, suggesting practical viability for resource-constrained deployment. This work establishes a foundational approach for low-bit, linearly-scaling LLMs and motivates hardware optimized for 1-bit Mamba-based architectures.

Abstract

The typical Selective State-Space Model (SSM) used in Mamba addresses several limitations of Transformers, such as the quadratic computational complexity with respect to sequence length and the significant memory requirements during inference due to the key-value (KV) cache. However, the increasing size of Mamba models continues to pose challenges for training and deployment, particularly due to their substantial computational demands during both training and inference. In this work, we introduce $\texttt{Bi-Mamba}$, a scalable and powerful 1-bit Mamba architecture designed to enable more efficient large language models (LLMs), with model sizes of 780M, 1.3B, and 2.7B parameters. $\texttt{Bi-Mamba}$ models are trained from scratch on a standard LLM-scale dataset using an autoregressive distillation loss. Extensive experiments on language modeling benchmarks demonstrate that $\texttt{Bi-Mamba}$ achieves performance comparable to its full-precision (FP16 or BF16) counterparts, while outperforming post-training binarization (PTB) Mamba and binarization-aware training (BAT) Transformer baselines. Moreover, $\texttt{Bi-Mamba}$ drastically reduces memory usage and computational cost compared to the original Mamba. Our work pioneers a new line of linear-complexity LLMs under low-bit representation and provides the way for the design of specialized hardware optimized for efficient 1-bit Mamba-based models. Code and the pre-trained weights are available at https://github.com/Tangshengku/Bi-Mamba.

Bi-Mamba: Towards Accurate 1-Bit State Space Models

TL;DR

Bi-Mamba presents a scalable 1-bit binarization of the Mamba State Space Model, preserving competitive performance while achieving linear-time complexity in sequence processing. By binarizing only the dominant linear modules via FBI-Linear and training with autoregressive distillation from a high-quality teacher, Bi-Mamba achieves strong zero-shot and perplexity results across 780M, 1.3B, and 2.7B parameter scales. The approach yields substantial reductions in memory, storage, and energy, with favorable results on multilingual and instruction-tuned settings, suggesting practical viability for resource-constrained deployment. This work establishes a foundational approach for low-bit, linearly-scaling LLMs and motivates hardware optimized for 1-bit Mamba-based architectures.

Abstract

The typical Selective State-Space Model (SSM) used in Mamba addresses several limitations of Transformers, such as the quadratic computational complexity with respect to sequence length and the significant memory requirements during inference due to the key-value (KV) cache. However, the increasing size of Mamba models continues to pose challenges for training and deployment, particularly due to their substantial computational demands during both training and inference. In this work, we introduce , a scalable and powerful 1-bit Mamba architecture designed to enable more efficient large language models (LLMs), with model sizes of 780M, 1.3B, and 2.7B parameters. models are trained from scratch on a standard LLM-scale dataset using an autoregressive distillation loss. Extensive experiments on language modeling benchmarks demonstrate that achieves performance comparable to its full-precision (FP16 or BF16) counterparts, while outperforming post-training binarization (PTB) Mamba and binarization-aware training (BAT) Transformer baselines. Moreover, drastically reduces memory usage and computational cost compared to the original Mamba. Our work pioneers a new line of linear-complexity LLMs under low-bit representation and provides the way for the design of specialized hardware optimized for efficient 1-bit Mamba-based models. Code and the pre-trained weights are available at https://github.com/Tangshengku/Bi-Mamba.

Paper Structure

This paper contains 21 sections, 6 equations, 13 figures, 10 tables.

Figures (13)

  • Figure 1: Perplexity comparison of Bi-Mamba, GPTQ and Bi-LLM on Wiki2, PTB and C4 datasets. GPTQ and Bi-LLM show significant performance degradation when the bit is low. Bi-Mamba demonstrates low perplexity in 1 bit and shows similar performance as GPTQ-8bit.
  • Figure 2: Illustration of the Bi-Mamba framework. Our Bi-Mamba binarizes both input and output projection matrices. Compared with the post-binarization method (Bi-LLM), our binarization-aware training method (Bi-Mamba) generates a more similar weight distribution (after scaling) on each part.
  • Figure 3: Visualization of results comparison on Mamba-2 in the scales of 2.7B, 1.3B and 780M.
  • Figure 4: The training result dynamics of Downstream performance and perplexity of Bi-Mamba.
  • Figure 5: The downstream performance and perplexity curve of Bi-Mamba-2.7B with different training costs.
  • ...and 8 more figures