OneBit: Towards Extremely Low-bit Large Language Models

Yuzhuang Xu; Xu Han; Zonghan Yang; Shuo Wang; Qingfu Zhu; Zhiyuan Liu; Weidong Liu; Wanxiang Che

OneBit: Towards Extremely Low-bit Large Language Models

Yuzhuang Xu, Xu Han, Zonghan Yang, Shuo Wang, Qingfu Zhu, Zhiyuan Liu, Weidong Liu, Wanxiang Che

TL;DR

This work tackles the challenge of deploying LLMs under extreme quantization by introducing OneBit, a 1-bit weight framework that represents weights as a sign matrix combined with two FP16 value vectors. The authors develop Sign-Value-Independent Decomposition (SVID) to initialize 1-bit models from full-precision weights and employ quantization-aware knowledge distillation to transfer teacher capabilities to the 1-bit student. Empirical results across multiple model families (OPT, LLaMA, LLaMA2) show OneBit with W1A16 significantly closes the gap to FP16, often outperforming PTQ baselines and achieving robust training stability. The work demonstrates substantial memory and potential computational gains, with practical implications for deploying large models on resource-constrained hardware. It also provides detailed ablations to guide future refinements in extremely low-bit quantization and knowledge transfer for LLMs.

Abstract

Model quantification uses low bit-width values to represent the weight matrices of existing models to be quantized, which is a promising approach to reduce both storage and computational overheads of deploying highly anticipated LLMs. However, current quantization methods suffer severe performance degradation when the bit-width is extremely reduced, and thus focus on utilizing 4-bit or 8-bit values to quantize models. This paper boldly quantizes the weight matrices of LLMs to 1-bit, paving the way for the extremely low bit-width deployment of LLMs. For this target, we introduce a 1-bit model compressing framework named OneBit, including a novel 1-bit parameter representation method to better quantize LLMs as well as an effective parameter initialization method based on matrix decomposition to improve the convergence speed of the quantization framework. Sufficient experimental results indicate that OneBit achieves good performance (at least 81% of the non-quantized performance on LLaMA models) with robust training processes when only using 1-bit weight matrices.

OneBit: Towards Extremely Low-bit Large Language Models

TL;DR

Abstract

Paper Structure (45 sections, 23 equations, 6 figures, 6 tables)

This paper contains 45 sections, 23 equations, 6 figures, 6 tables.

Introduction
Related Work
Large Language Model Compression
Large Language Model Quantization
Methodology
Background
1-bit Linear Layer Architecture
Sign-Value-Independent Decomposition
Proposition 1
Proposition 2
Knowledge Transfer
Experiments
Settings
Data
Training Details
...and 30 more sections

Figures (6)

Figure 1: The perplexity (lower scores mean better performance) of existing widely-used low-bit quantization methods on LLaMA-7B, reported on Wikitext2 wiki22016. All the examined previous approaches suffer from significant performance degradation when quantizing models to 2-bit values. Our 1-bit quantization method can outperform these 2-bit baselines.
Figure 2: The main idea of our method OneBit. The left is the original FP16 Linear Layer, in which both the activation $\mathbf{X}$ and the weight matrix $\mathbf{W}$ are in FP16 format. The right is our proposed architecture. Only value vectors $\mathbf{g}$ and $\mathbf{h}$ are in FP16 format, and the weight matrix consists of $\pm 1$ instead, which can be represented in INT1.
Figure 3: Comparison of model capabilities and compressive degree.
Figure 4: Tradeoff between size and PPL.
Figure 5: Training process of OneBit-7B.
...and 1 more figures

OneBit: Towards Extremely Low-bit Large Language Models

TL;DR

Abstract

OneBit: Towards Extremely Low-bit Large Language Models

Authors

TL;DR

Abstract

Table of Contents

Figures (6)