Table of Contents
Fetching ...

1-bit AI Infra: Part 1.1, Fast and Lossless BitNet b1.58 Inference on CPUs

Jinheng Wang, Hansong Zhou, Ting Song, Shaoguang Mao, Shuming Ma, Hongyu Wang, Yan Xia, Furu Wei

TL;DR

This work introduces bitnet.cpp, a tailored software stack designed to unlock the full potential of 1-bit LLMs, and develops a set of kernels to support fast and lossless inference of ternary BitNet b1.58 LLMs on CPUs.

Abstract

Recent advances in 1-bit Large Language Models (LLMs), such as BitNet and BitNet b1.58, present a promising approach to enhancing the efficiency of LLMs in terms of speed and energy consumption. These developments also enable local LLM deployment across a broad range of devices. In this work, we introduce bitnet.cpp, a tailored software stack designed to unlock the full potential of 1-bit LLMs. Specifically, we develop a set of kernels to support fast and lossless inference of ternary BitNet b1.58 LLMs on CPUs. Extensive experiments demonstrate that bitnet.cpp achieves significant speedups, ranging from 2.37x to 6.17x on x86 CPUs and from 1.37x to 5.07x on ARM CPUs, across various model sizes. The code is available at https://github.com/microsoft/BitNet.

1-bit AI Infra: Part 1.1, Fast and Lossless BitNet b1.58 Inference on CPUs

TL;DR

This work introduces bitnet.cpp, a tailored software stack designed to unlock the full potential of 1-bit LLMs, and develops a set of kernels to support fast and lossless inference of ternary BitNet b1.58 LLMs on CPUs.

Abstract

Recent advances in 1-bit Large Language Models (LLMs), such as BitNet and BitNet b1.58, present a promising approach to enhancing the efficiency of LLMs in terms of speed and energy consumption. These developments also enable local LLM deployment across a broad range of devices. In this work, we introduce bitnet.cpp, a tailored software stack designed to unlock the full potential of 1-bit LLMs. Specifically, we develop a set of kernels to support fast and lossless inference of ternary BitNet b1.58 LLMs on CPUs. Extensive experiments demonstrate that bitnet.cpp achieves significant speedups, ranging from 2.37x to 6.17x on x86 CPUs and from 1.37x to 5.07x on ARM CPUs, across various model sizes. The code is available at https://github.com/microsoft/BitNet.

Paper Structure

This paper contains 8 sections, 1 figure, 7 tables.

Figures (1)

  • Figure 1: Comparison of inference speed and energy consumption for various BitNet b1.58 model sizes on an Apple M2 Ultra (ARM CPU) using llama.cpp(fp16) llamacpp versus bitnet.cpp (ternary kernels). The results demonstrate that bitnet.cpp can achieve human reading speed, even for a 100B model on a single CPU. Notably, bitnet.cpp significantly reduces energy consumption across different model sizes.