The Era of 1-bit LLMs: All Large Language Models are in 1.58 Bits
Shuming Ma, Hongyu Wang, Lingxiao Ma, Lei Wang, Wenhui Wang, Shaohan Huang, Li Dong, Ruiping Wang, Jilong Xue, Furu Wei
TL;DR
This paper introduces BitNet b1.58, a 1.58-bit LLM with ternary weights {-1,0,1} that can match FP16/BF16 Transformer performance at the same model size and training tokens while dramatically reducing latency, memory, throughput requirements, and energy use. The approach uses absmean quantization to constrain weights and per-token activation scaling, integrates LLaMA-like components for open-source compatibility, and replaces standard linear layers with BitLinear. Empirical results show BitNet b1.58 achieves perplexity parity with FP16 baselines from roughly 3B parameters onward and delivers substantial efficiency gains, including up to 4.1x speedups at 70B, 71x matmul energy savings, and up to 11x batch-size–based throughput gains. Training with 2T tokens further demonstrates strong generalization, with BitNet outperforming StableLM-3B across multiple end tasks. The work outlines a path toward hardware designs and future extensions (e.g., 1-bit MoEs, long-sequence handling, edge/mobile deployment) that could unlock cost-effective, scalable LLMs and new computing paradigms.
Abstract
Recent research, such as BitNet, is paving the way for a new era of 1-bit Large Language Models (LLMs). In this work, we introduce a 1-bit LLM variant, namely BitNet b1.58, in which every single parameter (or weight) of the LLM is ternary {-1, 0, 1}. It matches the full-precision (i.e., FP16 or BF16) Transformer LLM with the same model size and training tokens in terms of both perplexity and end-task performance, while being significantly more cost-effective in terms of latency, memory, throughput, and energy consumption. More profoundly, the 1.58-bit LLM defines a new scaling law and recipe for training new generations of LLMs that are both high-performance and cost-effective. Furthermore, it enables a new computation paradigm and opens the door for designing specific hardware optimized for 1-bit LLMs.
