Table of Contents
Fetching ...

DB-LLM: Accurate Dual-Binarization for Efficient LLMs

Hong Chen, Chengtao Lv, Liang Ding, Haotong Qin, Xiabin Zhou, Yifu Ding, Xuebo Liu, Min Zhang, Jinyang Guo, Xianglong Liu, Dacheng Tao

TL;DR

DB-LLM introduces Flexible Dual Binarization (FDB) and Deviation-Aware Distillation (DAD) to enable accurate $2$-bit weight quantization for large language models. By splitting 2-bit weights into two independent 1-bit binaries with learnable scales, FDB preserves representation power while leveraging bitwise efficiency; DAD mitigates macro-level prediction distortion by emphasizing ambiguous samples through entropy-based reweighting. Empirical results on LLaMA-1/2 show DB-LLM achieves state-of-the-art perplexities and zero-shot accuracies at ultra-low bit-widths, along with major reductions in FLOPs and storage (e.g., up to $3.7\times$ storage savings and up to a $\sim14\times$ FLOP reduction). The work demonstrates practical ultra-low-bit quantization viability and provides a data-free calibration workflow, offering a scalable path toward efficient deployment of large language models.

Abstract

Large language models (LLMs) have significantly advanced the field of natural language processing, while the expensive memory and computation consumption impede their practical deployment. Quantization emerges as one of the most effective methods for improving the computational efficiency of LLMs. However, existing ultra-low-bit quantization always causes severe accuracy drops. In this paper, we empirically relieve the micro and macro characteristics of ultra-low bit quantization and present a novel Dual-Binarization method for LLMs, namely DB-LLM. For the micro-level, we take both the accuracy advantage of 2-bit-width and the efficiency advantage of binarization into account, introducing Flexible Dual Binarization (FDB). By splitting 2-bit quantized weights into two independent sets of binaries, FDB ensures the accuracy of representations and introduces flexibility, utilizing the efficient bitwise operations of binarization while retaining the inherent high sparsity of ultra-low bit quantization. For the macro-level, we find the distortion that exists in the prediction of LLM after quantization, which is specified as the deviations related to the ambiguity of samples. We propose the Deviation-Aware Distillation (DAD) method, enabling the model to focus differently on various samples. Comprehensive experiments show that our DB-LLM not only significantly surpasses the current State-of-The-Art (SoTA) in ultra-low bit quantization (eg, perplexity decreased from 9.64 to 7.23), but also achieves an additional 20\% reduction in computational consumption compared to the SOTA method under the same bit-width. Our code will be released soon.

DB-LLM: Accurate Dual-Binarization for Efficient LLMs

TL;DR

DB-LLM introduces Flexible Dual Binarization (FDB) and Deviation-Aware Distillation (DAD) to enable accurate -bit weight quantization for large language models. By splitting 2-bit weights into two independent 1-bit binaries with learnable scales, FDB preserves representation power while leveraging bitwise efficiency; DAD mitigates macro-level prediction distortion by emphasizing ambiguous samples through entropy-based reweighting. Empirical results on LLaMA-1/2 show DB-LLM achieves state-of-the-art perplexities and zero-shot accuracies at ultra-low bit-widths, along with major reductions in FLOPs and storage (e.g., up to storage savings and up to a FLOP reduction). The work demonstrates practical ultra-low-bit quantization viability and provides a data-free calibration workflow, offering a scalable path toward efficient deployment of large language models.

Abstract

Large language models (LLMs) have significantly advanced the field of natural language processing, while the expensive memory and computation consumption impede their practical deployment. Quantization emerges as one of the most effective methods for improving the computational efficiency of LLMs. However, existing ultra-low-bit quantization always causes severe accuracy drops. In this paper, we empirically relieve the micro and macro characteristics of ultra-low bit quantization and present a novel Dual-Binarization method for LLMs, namely DB-LLM. For the micro-level, we take both the accuracy advantage of 2-bit-width and the efficiency advantage of binarization into account, introducing Flexible Dual Binarization (FDB). By splitting 2-bit quantized weights into two independent sets of binaries, FDB ensures the accuracy of representations and introduces flexibility, utilizing the efficient bitwise operations of binarization while retaining the inherent high sparsity of ultra-low bit quantization. For the macro-level, we find the distortion that exists in the prediction of LLM after quantization, which is specified as the deviations related to the ambiguity of samples. We propose the Deviation-Aware Distillation (DAD) method, enabling the model to focus differently on various samples. Comprehensive experiments show that our DB-LLM not only significantly surpasses the current State-of-The-Art (SoTA) in ultra-low bit quantization (eg, perplexity decreased from 9.64 to 7.23), but also achieves an additional 20\% reduction in computational consumption compared to the SOTA method under the same bit-width. Our code will be released soon.
Paper Structure (22 sections, 9 equations, 7 figures, 7 tables)

This paper contains 22 sections, 9 equations, 7 figures, 7 tables.

Figures (7)

  • Figure 1: The perplexity on WikiText2 for LLaMA family models. 2-bit DB-LLM is close to FP results and surpasses 3-bit AWQ by a large margin.
  • Figure 2: Illustration of our proposed DB-LLM. The Flexible Dual Binarization (FDB) approach, employing two independent 1-bit sparse weights for simultaneous matrix multiplication, significantly enhances the flexibility in weight representation. Deviation-Aware Distillation (DAD) steers the quantized model towards a heightened focus on ambiguous samples, enhancing its performance by refining quantization parameters.
  • Figure 3: Distributions of the first output projection matrix (LLaMA-1-7B). Colored levels, indicating the optimal solutions from grid search, minimize the proxy quantization error (MSE loss of outputs) for binarization, 2-bit quantization, and FDB. Influenced by the weight distribution's normality, binarization compresses the two levels closer to 0 due to the absence of a level representing 0, hindering the precise representation of numerous significant weights with higher values, whose expression span is less than half that of the 2-bit.
  • Figure 4: Loss landscape of a single quantized linear layer based on binarization (a), 2-bit quantization (b), and our FDB (c). For (a), (b), and (c), we perturb the training parameters of the single layer and calculate the MSE loss, comparing the outputs of the quantized layer with those of the full-precision model. (d) highlights the disparity among the three surfaces by juxtaposing them within a single coordinate framework.
  • Figure 5: The splitting procedure of FDB. The dual separate 1-bit weight can be computed by comparing the central values.
  • ...and 2 more figures