BPDQ: Bit-Plane Decomposition Quantization on a Variable Grid for Large Language Models
Junyu Chen, Jungang Li, Jing Xiong, Wenjie Wang, Qingyao Yang, He Xiao, Zhen Li, Taiqiang Wu, Mengzhao Chen, Zhen Peng, Chaofan Tao, Long Shi, Hongxia Yang, Ngai Wong
TL;DR
BPDQ tackles memory and bandwidth bottlenecks in LLM inference by relaxing the rigid, shape-invariant quantization grids used in traditional optimization-based PTQ. It introduces a variable-grid quantization built from bit-planes and per-group scalar coefficients, refined within the Hessian-induced geometry and stabilized by delta corrections. Theoretical results show that the variable grid expands the feasible solution set for the $H$-metric projection, while experiments demonstrate strong 2-bit fidelity for 72B models, enabling deployment on consumer GPUs with LUT-accelerated decoding. The approach maintains activation outliers and shows favorable efficiency profiles, offering a practical path toward extreme model compression with broad applicability and clear directions for future hardware and fidelity improvements.
Abstract
Large language model (LLM) inference is often bounded by memory footprint and memory bandwidth in resource-constrained deployments, making quantization a fundamental technique for efficient serving. While post-training quantization (PTQ) maintains high fidelity at 4-bit, it deteriorates at 2-3 bits. Fundamentally, existing methods enforce a shape-invariant quantization grid (e.g., the fixed uniform intervals of UINT2) for each group, severely restricting the feasible set for error minimization. To address this, we propose Bit-Plane Decomposition Quantization (BPDQ), which constructs a variable quantization grid via bit-planes and scalar coefficients, and iteratively refines them using approximate second-order information while progressively compensating quantization errors to minimize output discrepancy. In the 2-bit regime, BPDQ enables serving Qwen2.5-72B on a single RTX 3090 with 83.85% GSM8K accuracy (vs. 90.83% at 16-bit). Moreover, we provide theoretical analysis showing that the variable grid expands the feasible set, and that the quantization process consistently aligns with the optimization objective in Hessian-induced geometry. Code: github.com/KingdalfGoodman/BPDQ.
