Table of Contents
Fetching ...

BPDQ: Bit-Plane Decomposition Quantization on a Variable Grid for Large Language Models

Junyu Chen, Jungang Li, Jing Xiong, Wenjie Wang, Qingyao Yang, He Xiao, Zhen Li, Taiqiang Wu, Mengzhao Chen, Zhen Peng, Chaofan Tao, Long Shi, Hongxia Yang, Ngai Wong

TL;DR

BPDQ tackles memory and bandwidth bottlenecks in LLM inference by relaxing the rigid, shape-invariant quantization grids used in traditional optimization-based PTQ. It introduces a variable-grid quantization built from bit-planes and per-group scalar coefficients, refined within the Hessian-induced geometry and stabilized by delta corrections. Theoretical results show that the variable grid expands the feasible solution set for the $H$-metric projection, while experiments demonstrate strong 2-bit fidelity for 72B models, enabling deployment on consumer GPUs with LUT-accelerated decoding. The approach maintains activation outliers and shows favorable efficiency profiles, offering a practical path toward extreme model compression with broad applicability and clear directions for future hardware and fidelity improvements.

Abstract

Large language model (LLM) inference is often bounded by memory footprint and memory bandwidth in resource-constrained deployments, making quantization a fundamental technique for efficient serving. While post-training quantization (PTQ) maintains high fidelity at 4-bit, it deteriorates at 2-3 bits. Fundamentally, existing methods enforce a shape-invariant quantization grid (e.g., the fixed uniform intervals of UINT2) for each group, severely restricting the feasible set for error minimization. To address this, we propose Bit-Plane Decomposition Quantization (BPDQ), which constructs a variable quantization grid via bit-planes and scalar coefficients, and iteratively refines them using approximate second-order information while progressively compensating quantization errors to minimize output discrepancy. In the 2-bit regime, BPDQ enables serving Qwen2.5-72B on a single RTX 3090 with 83.85% GSM8K accuracy (vs. 90.83% at 16-bit). Moreover, we provide theoretical analysis showing that the variable grid expands the feasible set, and that the quantization process consistently aligns with the optimization objective in Hessian-induced geometry. Code: github.com/KingdalfGoodman/BPDQ.

BPDQ: Bit-Plane Decomposition Quantization on a Variable Grid for Large Language Models

TL;DR

BPDQ tackles memory and bandwidth bottlenecks in LLM inference by relaxing the rigid, shape-invariant quantization grids used in traditional optimization-based PTQ. It introduces a variable-grid quantization built from bit-planes and per-group scalar coefficients, refined within the Hessian-induced geometry and stabilized by delta corrections. Theoretical results show that the variable grid expands the feasible solution set for the -metric projection, while experiments demonstrate strong 2-bit fidelity for 72B models, enabling deployment on consumer GPUs with LUT-accelerated decoding. The approach maintains activation outliers and shows favorable efficiency profiles, offering a practical path toward extreme model compression with broad applicability and clear directions for future hardware and fidelity improvements.

Abstract

Large language model (LLM) inference is often bounded by memory footprint and memory bandwidth in resource-constrained deployments, making quantization a fundamental technique for efficient serving. While post-training quantization (PTQ) maintains high fidelity at 4-bit, it deteriorates at 2-3 bits. Fundamentally, existing methods enforce a shape-invariant quantization grid (e.g., the fixed uniform intervals of UINT2) for each group, severely restricting the feasible set for error minimization. To address this, we propose Bit-Plane Decomposition Quantization (BPDQ), which constructs a variable quantization grid via bit-planes and scalar coefficients, and iteratively refines them using approximate second-order information while progressively compensating quantization errors to minimize output discrepancy. In the 2-bit regime, BPDQ enables serving Qwen2.5-72B on a single RTX 3090 with 83.85% GSM8K accuracy (vs. 90.83% at 16-bit). Moreover, we provide theoretical analysis showing that the variable grid expands the feasible set, and that the quantization process consistently aligns with the optimization objective in Hessian-induced geometry. Code: github.com/KingdalfGoodman/BPDQ.
Paper Structure (42 sections, 31 equations, 3 figures, 5 tables)

This paper contains 42 sections, 31 equations, 3 figures, 5 tables.

Figures (3)

  • Figure 1: (a) Fixed grids (Uniform/Non-Uniform) enforce shape invariance, where the relative spacing of quantization levels is shared across groups (scaled by $s$). BPDQ breaks this limitation by constructing a variable grid per group using bit-plane coefficients ($c_1$, $c_2$), expanding the feasible set. (b) Performance comparison of 2-bit quantized Qwen2.5-72B.
  • Figure 2: Overview of the 2-bit BPDQ quantization procedure.
  • Figure 3: LongBench performance comparison on Qwen2.5-7B.