Table of Contents
Fetching ...

Bitnet.cpp: Efficient Edge Inference for Ternary LLMs

Jinheng Wang, Hansong Zhou, Ting Song, Shijie Cao, Yan Xia, Ting Cao, Jianyu Wei, Shuming Ma, Hongyu Wang, Furu Wei

TL;DR

Bitnet.cpp targets efficient, lossless edge inference for ternary LLMs, addressing non-integer bits per weight and memory-alignment constraints by introducing two core solutions: an element-wise LUT-based mpGEMM (TL) and a lossless MAD-based kernel (I2_S). It reframes edge mpGEMM into a practical taxonomy and demonstrates up to 6.25x speedups over full-precision baselines while preserving BitNet b1.58 accuracy, with TL2_0 delivering strong LUT-based performance against MAD baselines. The work also extends TL to ELUT in the appendix, analyzes compute-memory trade-offs, and provides hardware-aware insights to guide future edge-accelerator designs. Collectively, Bitnet.cpp advances practical deployment of low-bpw LLMs on edge devices, balancing speed, accuracy, and memory considerations through novel kernel designs.

Abstract

The advent of 1-bit large language models (LLMs), led by BitNet b1.58, has spurred interest in ternary LLMs. Despite this, research and practical applications focusing on efficient edge inference for ternary LLMs remain scarce. To bridge this gap, we introduce Bitnet.cpp, an inference system optimized for BitNet b1.58 and ternary LLMs. Given that mixed-precision matrix multiplication (mpGEMM) constitutes the bulk of inference time in ternary LLMs, Bitnet.cpp incorporates a novel mpGEMM library to facilitate sub-2-bits-per-weight, efficient and lossless inference. The library features two core solutions: Ternary Lookup Table (TL), which addresses spatial inefficiencies of previous bit-wise methods, and Int2 with a Scale (I2_S), which ensures lossless edge inference, both enabling high-speed inference. Our experiments show that Bitnet.cpp achieves up to a 6.25x increase in speed over full-precision baselines and up to 2.32x over low-bit baselines, setting new benchmarks in the field. Additionally, we expand TL to element-wise lookup table (ELUT) for low-bit LLMs in the appendix, presenting both theoretical and empirical evidence of its considerable potential. Bitnet.cpp is publicly available at https://github.com/microsoft/BitNet/tree/paper , offering a sophisticated solution for the efficient and practical deployment of edge LLMs.

Bitnet.cpp: Efficient Edge Inference for Ternary LLMs

TL;DR

Bitnet.cpp targets efficient, lossless edge inference for ternary LLMs, addressing non-integer bits per weight and memory-alignment constraints by introducing two core solutions: an element-wise LUT-based mpGEMM (TL) and a lossless MAD-based kernel (I2_S). It reframes edge mpGEMM into a practical taxonomy and demonstrates up to 6.25x speedups over full-precision baselines while preserving BitNet b1.58 accuracy, with TL2_0 delivering strong LUT-based performance against MAD baselines. The work also extends TL to ELUT in the appendix, analyzes compute-memory trade-offs, and provides hardware-aware insights to guide future edge-accelerator designs. Collectively, Bitnet.cpp advances practical deployment of low-bpw LLMs on edge devices, balancing speed, accuracy, and memory considerations through novel kernel designs.

Abstract

The advent of 1-bit large language models (LLMs), led by BitNet b1.58, has spurred interest in ternary LLMs. Despite this, research and practical applications focusing on efficient edge inference for ternary LLMs remain scarce. To bridge this gap, we introduce Bitnet.cpp, an inference system optimized for BitNet b1.58 and ternary LLMs. Given that mixed-precision matrix multiplication (mpGEMM) constitutes the bulk of inference time in ternary LLMs, Bitnet.cpp incorporates a novel mpGEMM library to facilitate sub-2-bits-per-weight, efficient and lossless inference. The library features two core solutions: Ternary Lookup Table (TL), which addresses spatial inefficiencies of previous bit-wise methods, and Int2 with a Scale (I2_S), which ensures lossless edge inference, both enabling high-speed inference. Our experiments show that Bitnet.cpp achieves up to a 6.25x increase in speed over full-precision baselines and up to 2.32x over low-bit baselines, setting new benchmarks in the field. Additionally, we expand TL to element-wise lookup table (ELUT) for low-bit LLMs in the appendix, presenting both theoretical and empirical evidence of its considerable potential. Bitnet.cpp is publicly available at https://github.com/microsoft/BitNet/tree/paper , offering a sophisticated solution for the efficient and practical deployment of edge LLMs.

Paper Structure

This paper contains 34 sections, 6 equations, 11 figures, 7 tables, 4 algorithms.

Figures (11)

  • Figure 1: A comparison of end-to-end inference speeds on a 100B ternary LLM. $(bx)$ represents the bits per weight, where $x$ denotes specific value. "N/A" indicates that the tested CPU cannot host the specified model size with the given kernel.
  • Figure 2: An example to demonstrate lossless inference for BitNet b1.58 with Bitnet.cpp.
  • Figure 3: A taxonomy of mpGEMM solutions for ternary LLMs on edge devices. TL and I2_S are integrated in Bitnet.cpp, while QX and TQX are integrated in llama.cpp.
  • Figure 4: A simple example to explain the differences between various methods for completing mpGEMM when $K=4$: (1) represents the MAD-based solution, where the result is obtained via the dot product; (2) represents the bit-wise LUT-based solution, where the weights are split into different bit indices, and the result is obtained by performing a lookup in the LUT, followed by bit-shifting and accumulation; (3) represents the element-wise LUT-based solution, where all possible values of the weights are enumerated to obtain the index, and the result is obtained by performing a lookup in the LUT, followed by accumulation. $A_x$ refers to the $x_{th}$ bit in weight $A$. In (2), $g = 4$ and $b = 2$; whereas in (3) $g = 2$ and $C = 3$.
  • Figure 5: The TL2 design uses signed-unsigned weight splitting. First, a 4-bit index weight is used to look up the table and obtain the unsigned result. Then, the corresponding 1-bit sign weight is applied to perform the sign operation on the unsigned result, yielding the final output.
  • ...and 6 more figures