Table of Contents
Fetching ...

BitNet: Scaling 1-bit Transformers for Large Language Models

Hongyu Wang, Shuming Ma, Li Dong, Shaohan Huang, Huaijie Wang, Lingxiao Ma, Fan Yang, Ruiping Wang, Yi Wu, Furu Wei

TL;DR

BitNet introduces BitLinear, a 1-bit Transformer designed for large language models to reduce memory and energy costs while retaining competitive performance. Training combines quantization-aware techniques with straight-through estimation, high-precision optimizer states, and a large learning rate to achieve stability and fast convergence. The method demonstrates a power-law scaling of loss with model size, comparable to FP16 transformers, and shows superior energy efficiency and performance on downstream tasks versus post-training quantization baselines. This work suggests that aggressive quantization, when coupled with careful training dynamics and group-parallel strategies, can enable efficient scaling to even larger LLMs without sacrificing accuracy.

Abstract

The increasing size of large language models has posed challenges for deployment and raised concerns about environmental impact due to high energy consumption. In this work, we introduce BitNet, a scalable and stable 1-bit Transformer architecture designed for large language models. Specifically, we introduce BitLinear as a drop-in replacement of the nn.Linear layer in order to train 1-bit weights from scratch. Experimental results on language modeling show that BitNet achieves competitive performance while substantially reducing memory footprint and energy consumption, compared to state-of-the-art 8-bit quantization methods and FP16 Transformer baselines. Furthermore, BitNet exhibits a scaling law akin to full-precision Transformers, suggesting its potential for effective scaling to even larger language models while maintaining efficiency and performance benefits.

BitNet: Scaling 1-bit Transformers for Large Language Models

TL;DR

BitNet introduces BitLinear, a 1-bit Transformer designed for large language models to reduce memory and energy costs while retaining competitive performance. Training combines quantization-aware techniques with straight-through estimation, high-precision optimizer states, and a large learning rate to achieve stability and fast convergence. The method demonstrates a power-law scaling of loss with model size, comparable to FP16 transformers, and shows superior energy efficiency and performance on downstream tasks versus post-training quantization baselines. This work suggests that aggressive quantization, when coupled with careful training dynamics and group-parallel strategies, can enable efficient scaling to even larger LLMs without sacrificing accuracy.

Abstract

The increasing size of large language models has posed challenges for deployment and raised concerns about environmental impact due to high energy consumption. In this work, we introduce BitNet, a scalable and stable 1-bit Transformer architecture designed for large language models. Specifically, we introduce BitLinear as a drop-in replacement of the nn.Linear layer in order to train 1-bit weights from scratch. Experimental results on language modeling show that BitNet achieves competitive performance while substantially reducing memory footprint and energy consumption, compared to state-of-the-art 8-bit quantization methods and FP16 Transformer baselines. Furthermore, BitNet exhibits a scaling law akin to full-precision Transformers, suggesting its potential for effective scaling to even larger language models while maintaining efficiency and performance benefits.
Paper Structure (21 sections, 16 equations, 6 figures, 8 tables)

This paper contains 21 sections, 16 equations, 6 figures, 8 tables.

Figures (6)

  • Figure 1: BitNet trains 1-bit Transformers from scratch, obtaining competitive results in an energy-efficient way. BitNet significantly outperforms state-of-the-art quantization methods. As the model size scales up, the cost savings become more significant while achieving competitive performance with the models trained with FP16.
  • Figure 2: (a) The computation flow of BitLinear. (b) The architecture of BitNet, consisting of the stacks of attentions and FFNs, where matrix multiplication is implemented as BitLinear.
  • Figure 3: Scaling curves of BitNet and FP16 Transformers.
  • Figure 4: Zero-shot (Left) and few-shot (Right) performance of BitNet and FP16 Transformer against the inference cost.
  • Figure 5: BitNet is more stable than FP16 Transformer with a same learning rate (Left). The training stability enables BitNet a larger learning rate, resulting in better convergence (Right).
  • ...and 1 more figures