Table of Contents
Fetching ...

The Era of 1-bit LLMs: All Large Language Models are in 1.58 Bits

Shuming Ma, Hongyu Wang, Lingxiao Ma, Lei Wang, Wenhui Wang, Shaohan Huang, Li Dong, Ruiping Wang, Jilong Xue, Furu Wei

TL;DR

This paper introduces BitNet b1.58, a 1.58-bit LLM with ternary weights {-1,0,1} that can match FP16/BF16 Transformer performance at the same model size and training tokens while dramatically reducing latency, memory, throughput requirements, and energy use. The approach uses absmean quantization to constrain weights and per-token activation scaling, integrates LLaMA-like components for open-source compatibility, and replaces standard linear layers with BitLinear. Empirical results show BitNet b1.58 achieves perplexity parity with FP16 baselines from roughly 3B parameters onward and delivers substantial efficiency gains, including up to 4.1x speedups at 70B, 71x matmul energy savings, and up to 11x batch-size–based throughput gains. Training with 2T tokens further demonstrates strong generalization, with BitNet outperforming StableLM-3B across multiple end tasks. The work outlines a path toward hardware designs and future extensions (e.g., 1-bit MoEs, long-sequence handling, edge/mobile deployment) that could unlock cost-effective, scalable LLMs and new computing paradigms.

Abstract

Recent research, such as BitNet, is paving the way for a new era of 1-bit Large Language Models (LLMs). In this work, we introduce a 1-bit LLM variant, namely BitNet b1.58, in which every single parameter (or weight) of the LLM is ternary {-1, 0, 1}. It matches the full-precision (i.e., FP16 or BF16) Transformer LLM with the same model size and training tokens in terms of both perplexity and end-task performance, while being significantly more cost-effective in terms of latency, memory, throughput, and energy consumption. More profoundly, the 1.58-bit LLM defines a new scaling law and recipe for training new generations of LLMs that are both high-performance and cost-effective. Furthermore, it enables a new computation paradigm and opens the door for designing specific hardware optimized for 1-bit LLMs.

The Era of 1-bit LLMs: All Large Language Models are in 1.58 Bits

TL;DR

This paper introduces BitNet b1.58, a 1.58-bit LLM with ternary weights {-1,0,1} that can match FP16/BF16 Transformer performance at the same model size and training tokens while dramatically reducing latency, memory, throughput requirements, and energy use. The approach uses absmean quantization to constrain weights and per-token activation scaling, integrates LLaMA-like components for open-source compatibility, and replaces standard linear layers with BitLinear. Empirical results show BitNet b1.58 achieves perplexity parity with FP16 baselines from roughly 3B parameters onward and delivers substantial efficiency gains, including up to 4.1x speedups at 70B, 71x matmul energy savings, and up to 11x batch-size–based throughput gains. Training with 2T tokens further demonstrates strong generalization, with BitNet outperforming StableLM-3B across multiple end tasks. The work outlines a path toward hardware designs and future extensions (e.g., 1-bit MoEs, long-sequence handling, edge/mobile deployment) that could unlock cost-effective, scalable LLMs and new computing paradigms.

Abstract

Recent research, such as BitNet, is paving the way for a new era of 1-bit Large Language Models (LLMs). In this work, we introduce a 1-bit LLM variant, namely BitNet b1.58, in which every single parameter (or weight) of the LLM is ternary {-1, 0, 1}. It matches the full-precision (i.e., FP16 or BF16) Transformer LLM with the same model size and training tokens in terms of both perplexity and end-task performance, while being significantly more cost-effective in terms of latency, memory, throughput, and energy consumption. More profoundly, the 1.58-bit LLM defines a new scaling law and recipe for training new generations of LLMs that are both high-performance and cost-effective. Furthermore, it enables a new computation paradigm and opens the door for designing specific hardware optimized for 1-bit LLMs.
Paper Structure (10 sections, 3 equations, 3 figures, 4 tables)

This paper contains 10 sections, 3 equations, 3 figures, 4 tables.

Figures (3)

  • Figure 1: 1-bit LLMs (e.g., BitNet b1.58) provide a Pareto solution to reduce inference cost (latency, throughput, and energy) of LLMs while maintaining model performance. The new computation paradigm of BitNet b1.58 calls for actions to design new hardware optimized for 1-bit LLMs.
  • Figure 2: Decoding latency (Left) and memory consumption (Right) of BitNet b1.58 varying the model size.
  • Figure 3: Energy consumption of BitNet b1.58 compared to LLaMA LLM at 7nm process nodes. On the left is the components of arithmetic operations energy. On the right is the end-to-end energy cost across different model sizes.