OneBit: Towards Extremely Low-bit Large Language Models
Yuzhuang Xu, Xu Han, Zonghan Yang, Shuo Wang, Qingfu Zhu, Zhiyuan Liu, Weidong Liu, Wanxiang Che
TL;DR
This work tackles the challenge of deploying LLMs under extreme quantization by introducing OneBit, a 1-bit weight framework that represents weights as a sign matrix combined with two FP16 value vectors. The authors develop Sign-Value-Independent Decomposition (SVID) to initialize 1-bit models from full-precision weights and employ quantization-aware knowledge distillation to transfer teacher capabilities to the 1-bit student. Empirical results across multiple model families (OPT, LLaMA, LLaMA2) show OneBit with W1A16 significantly closes the gap to FP16, often outperforming PTQ baselines and achieving robust training stability. The work demonstrates substantial memory and potential computational gains, with practical implications for deploying large models on resource-constrained hardware. It also provides detailed ablations to guide future refinements in extremely low-bit quantization and knowledge transfer for LLMs.
Abstract
Model quantification uses low bit-width values to represent the weight matrices of existing models to be quantized, which is a promising approach to reduce both storage and computational overheads of deploying highly anticipated LLMs. However, current quantization methods suffer severe performance degradation when the bit-width is extremely reduced, and thus focus on utilizing 4-bit or 8-bit values to quantize models. This paper boldly quantizes the weight matrices of LLMs to 1-bit, paving the way for the extremely low bit-width deployment of LLMs. For this target, we introduce a 1-bit model compressing framework named OneBit, including a novel 1-bit parameter representation method to better quantize LLMs as well as an effective parameter initialization method based on matrix decomposition to improve the convergence speed of the quantization framework. Sufficient experimental results indicate that OneBit achieves good performance (at least 81% of the non-quantized performance on LLaMA models) with robust training processes when only using 1-bit weight matrices.
