Table of Contents
Fetching ...

DBellQuant: Breaking the Bell with Double-Bell Transformation for LLMs Post Training Binarization

Zijian Ye, Wei Huang, Yifei Yu, Tianhe Ren, Zhongrui Wang, Xiaojuan Qi

TL;DR

DBellQuant tackles the challenge of post-training quantization for large language models by introducing a per-channel Learnable Transformation for Dual-Bell Quantization (LTDB) that reshapes weight distributions from unimodal to bimodal, enabling near $1$-bit weight binarization. The inverse of the transformation is applied to activations, which, together with activation-aware initialization and dual loss objectives, smooths activations and suppresses outliers to support low-bit activation quantization. The method leverages two targeted losses, DTMD and DTNP, plus an early-stopping mechanism to reliably converge to a dual-bell weight distribution while preserving computation. Empirical results across multiple LLM families show state-of-the-art performance under aggressive quantization, achieving nearly 1-bit weights with 6-bit or even 4-bit activations and substantial model-size reductions, with practical speedups suitable for edge and real-world deployments. This approach broadens the feasibility of deploying large-scale models in resource-constrained environments without retraining, contributing to more sustainable and accessible AI infrastructure.

Abstract

Large language models (LLMs) demonstrate remarkable performance but face substantial computational and memory challenges that limit their practical deployment. Quantization has emerged as a promising solution; however, its effectiveness is often limited by quantization errors arising from weight distributions that are not quantization-friendly and the presence of activation outliers. To address these challenges, we introduce DBellQuant, an innovative post-training quantization (PTQ) framework that achieves nearly 1-bit weight compression and 6-bit activation quantization with minimal performance degradation. DBellQuant uses Learnable Transformation for Dual-Bell (LTDB) algorithm, which transforms single-bell weight distributions into dual-bell forms to reduce binarization errors and applies inverse transformations to smooth activations. DBellQuant sets a new state-of-the-art by preserving superior model performance under aggressive weight and activation quantization. For example, on the Wikitext2 dataset, DBellQuant achieves a perplexity of 14.39 on LLaMA2-13B with 6-bit activation quantization, significantly outperforming BiLLM's 21.35 without activation quantization, underscoring its potential in compressing LLMs for real-world applications.

DBellQuant: Breaking the Bell with Double-Bell Transformation for LLMs Post Training Binarization

TL;DR

DBellQuant tackles the challenge of post-training quantization for large language models by introducing a per-channel Learnable Transformation for Dual-Bell Quantization (LTDB) that reshapes weight distributions from unimodal to bimodal, enabling near -bit weight binarization. The inverse of the transformation is applied to activations, which, together with activation-aware initialization and dual loss objectives, smooths activations and suppresses outliers to support low-bit activation quantization. The method leverages two targeted losses, DTMD and DTNP, plus an early-stopping mechanism to reliably converge to a dual-bell weight distribution while preserving computation. Empirical results across multiple LLM families show state-of-the-art performance under aggressive quantization, achieving nearly 1-bit weights with 6-bit or even 4-bit activations and substantial model-size reductions, with practical speedups suitable for edge and real-world deployments. This approach broadens the feasibility of deploying large-scale models in resource-constrained environments without retraining, contributing to more sustainable and accessible AI infrastructure.

Abstract

Large language models (LLMs) demonstrate remarkable performance but face substantial computational and memory challenges that limit their practical deployment. Quantization has emerged as a promising solution; however, its effectiveness is often limited by quantization errors arising from weight distributions that are not quantization-friendly and the presence of activation outliers. To address these challenges, we introduce DBellQuant, an innovative post-training quantization (PTQ) framework that achieves nearly 1-bit weight compression and 6-bit activation quantization with minimal performance degradation. DBellQuant uses Learnable Transformation for Dual-Bell (LTDB) algorithm, which transforms single-bell weight distributions into dual-bell forms to reduce binarization errors and applies inverse transformations to smooth activations. DBellQuant sets a new state-of-the-art by preserving superior model performance under aggressive weight and activation quantization. For example, on the Wikitext2 dataset, DBellQuant achieves a perplexity of 14.39 on LLaMA2-13B with 6-bit activation quantization, significantly outperforming BiLLM's 21.35 without activation quantization, underscoring its potential in compressing LLMs for real-world applications.

Paper Structure

This paper contains 52 sections, 2 theorems, 43 equations, 23 figures, 25 tables, 1 algorithm.

Key Result

Theorem 1

Let $\boldsymbol{W} \in \mathbb{R}^{n \times m}$ be a weight matrix where each channel $\boldsymbol{w}_i$ (for $i \in \{1, 2, \dots, n\}$) is sampled from a single-bell Gaussian distribution $\boldsymbol{w}_i \sim \mathcal{N}(\mu_i, \sigma_i^2)$. There exists a learnable matrix $T \in \mathbb{R}^{m where $\pi \in (0, 1)$ is the mixing coefficient, and $\mu_1, \mu_2, \sigma_1^2, \sigma_2^2$ are pa

Figures (23)

  • Figure 1: Performance on Wikitext2 dataset. DBellQuant outperforms weight-only quantization method under 8-bit activation setting.
  • Figure 1: Performance w/o LTDB Algorithm.
  • Figure 2: (a) Before applying DBellQuant, activations exhibit significant outliers, making quantization challenging, while the single-bell-shaped weight distribution hinders binarization. (b) After applying DBellQuant, activations are smoothed with substantially fewer outliers, facilitating easier quantization. Weight distribution is transformed to dual-bell form, which is more conducive to binarization.
  • Figure 2: DeepSeek-R1-Distill-Qwen-7B results.
  • Figure 3: DBellQuant Framework Overview: (a)First, we can see that the origin weight distribution is single-bell. (b)We utilize Activation-aware initialization to generate origin transformation matrix. (c)We employ the LTDB algorithm for iterative training of the transformation matrix, applying the proposed Dual-Transformation Loss in two ways: for training and as the termination criterion for the training process. (d)The weight distribution after transformation will be double-bell.
  • ...and 18 more figures

Theorems & Definitions (4)

  • Theorem 1
  • proof
  • Theorem 2
  • proof