BPDQ: Bit-Plane Decomposition Quantization on a Variable Grid for Large Language Models

Junyu Chen; Jungang Li; Jing Xiong; Wenjie Wang; Qingyao Yang; He Xiao; Zhen Li; Taiqiang Wu; Mengzhao Chen; Zhen Peng; Chaofan Tao; Long Shi; Hongxia Yang; Ngai Wong

BPDQ: Bit-Plane Decomposition Quantization on a Variable Grid for Large Language Models

Junyu Chen, Jungang Li, Jing Xiong, Wenjie Wang, Qingyao Yang, He Xiao, Zhen Li, Taiqiang Wu, Mengzhao Chen, Zhen Peng, Chaofan Tao, Long Shi, Hongxia Yang, Ngai Wong

TL;DR

BPDQ tackles memory and bandwidth bottlenecks in LLM inference by relaxing the rigid, shape-invariant quantization grids used in traditional optimization-based PTQ. It introduces a variable-grid quantization built from bit-planes and per-group scalar coefficients, refined within the Hessian-induced geometry and stabilized by delta corrections. Theoretical results show that the variable grid expands the feasible solution set for the $H$-metric projection, while experiments demonstrate strong 2-bit fidelity for 72B models, enabling deployment on consumer GPUs with LUT-accelerated decoding. The approach maintains activation outliers and shows favorable efficiency profiles, offering a practical path toward extreme model compression with broad applicability and clear directions for future hardware and fidelity improvements.

Abstract

Large language model (LLM) inference is often bounded by memory footprint and memory bandwidth in resource-constrained deployments, making quantization a fundamental technique for efficient serving. While post-training quantization (PTQ) maintains high fidelity at 4-bit, it deteriorates at 2-3 bits. Fundamentally, existing methods enforce a shape-invariant quantization grid (e.g., the fixed uniform intervals of UINT2) for each group, severely restricting the feasible set for error minimization. To address this, we propose Bit-Plane Decomposition Quantization (BPDQ), which constructs a variable quantization grid via bit-planes and scalar coefficients, and iteratively refines them using approximate second-order information while progressively compensating quantization errors to minimize output discrepancy. In the 2-bit regime, BPDQ enables serving Qwen2.5-72B on a single RTX 3090 with 83.85% GSM8K accuracy (vs. 90.83% at 16-bit). Moreover, we provide theoretical analysis showing that the variable grid expands the feasible set, and that the quantization process consistently aligns with the optimization objective in Hessian-induced geometry. Code: github.com/KingdalfGoodman/BPDQ.

BPDQ: Bit-Plane Decomposition Quantization on a Variable Grid for Large Language Models

TL;DR

-metric projection, while experiments demonstrate strong 2-bit fidelity for 72B models, enabling deployment on consumer GPUs with LUT-accelerated decoding. The approach maintains activation outliers and shows favorable efficiency profiles, offering a practical path toward extreme model compression with broad applicability and clear directions for future hardware and fidelity improvements.

Abstract

Paper Structure (42 sections, 31 equations, 3 figures, 5 tables)

This paper contains 42 sections, 31 equations, 3 figures, 5 tables.

Introduction
Related Work
Methodology
Preliminaries
Optimization Objective.
Quantization Error Compensation.
Variable Grid Initialization
Bit-Plane Selection.
Scalar Coefficient Fitting.
Iteration under the Optimization Objective
Bit-plane Update.
Coefficient Refitting.
Delta Correction.
Experiments
Experimental Setup
...and 27 more sections

Figures (3)

Figure 1: (a) Fixed grids (Uniform/Non-Uniform) enforce shape invariance, where the relative spacing of quantization levels is shared across groups (scaled by $s$). BPDQ breaks this limitation by constructing a variable grid per group using bit-plane coefficients ($c_1$, $c_2$), expanding the feasible set. (b) Performance comparison of 2-bit quantized Qwen2.5-72B.
Figure 2: Overview of the 2-bit BPDQ quantization procedure.
Figure 3: LongBench performance comparison on Qwen2.5-7B.

BPDQ: Bit-Plane Decomposition Quantization on a Variable Grid for Large Language Models

TL;DR

Abstract

BPDQ: Bit-Plane Decomposition Quantization on a Variable Grid for Large Language Models

Authors

TL;DR

Abstract

Table of Contents

Figures (3)