Table of Contents
Fetching ...

CBQ: Cross-Block Quantization for Large Language Models

Xin Ding, Xiaoyu Liu, Zhijun Tu, Yun Zhang, Wei Li, Jie Hu, Hanting Chen, Yehui Tang, Zhiwei Xiong, Baoqun Yin, Yunhe Wang

TL;DR

CBQ introduces a cross-block, reconstruction-based PTQ framework for large language models to address the strong inter-block dependencies that arise in ultra-low-bit quantization. It combines a sliding-window cross-block dependency (CBD) mechanism, a LoRA-Rounding scheme with low-rank compensation, and a coarse-to-fine preprocessing (CFP) pipeline to robustly detect and manage weight and activation outliers. The method jointly optimizes quantization parameters and compensation matrices within overlapping windows, achieving state-of-the-art accuracy on W4A4, W4A8, and W2A16 across OPT and LLAMA models, including 4-bit LLAMA1-65B quantization in 4.3 hours on a single GPU. These contributions yield a practical, efficient PTQ pathway for deploying large language models at ultra-low bitrates with minimal performance degradation, enabling faster and cheaper inference. CBQ’s performance gains are supported by extensive ablations quantifying the impact of CBD, LoRA-Rounding, and CFP on zero-shot tasks and generation perplexity.

Abstract

Post-training quantization (PTQ) has played a key role in compressing large language models (LLMs) with ultra-low costs. However, existing PTQ methods only focus on handling the outliers within one layer or one block, which ignores the dependency of blocks and leads to severe performance degradation in low-bit settings. In this paper, we propose CBQ, a cross-block reconstruction-based PTQ method for LLMs. CBQ employs a cross-block dependency using a homologous reconstruction scheme, establishing long-range dependencies across multiple blocks to minimize error accumulation. Furthermore, CBQ incorporates a coarse-to-fine preprocessing (CFP) strategy for suppressing weight and activation outliers, coupled with an adaptive LoRA-Rounding technique for precise weight quantization. These innovations enable CBQ to not only handle extreme outliers effectively but also improve overall quantization accuracy. Extensive experiments show that CBQ achieves superior low-bit quantization (W4A4, W4A8, W2A16) and outperforms existing state-of-the-art methods across various LLMs and datasets. Notably, CBQ quantizes the 4-bit LLAMA1-65B model within only 4.3 hours on a single GPU, achieving a commendable tradeoff between performance and quantization efficiency.

CBQ: Cross-Block Quantization for Large Language Models

TL;DR

CBQ introduces a cross-block, reconstruction-based PTQ framework for large language models to address the strong inter-block dependencies that arise in ultra-low-bit quantization. It combines a sliding-window cross-block dependency (CBD) mechanism, a LoRA-Rounding scheme with low-rank compensation, and a coarse-to-fine preprocessing (CFP) pipeline to robustly detect and manage weight and activation outliers. The method jointly optimizes quantization parameters and compensation matrices within overlapping windows, achieving state-of-the-art accuracy on W4A4, W4A8, and W2A16 across OPT and LLAMA models, including 4-bit LLAMA1-65B quantization in 4.3 hours on a single GPU. These contributions yield a practical, efficient PTQ pathway for deploying large language models at ultra-low bitrates with minimal performance degradation, enabling faster and cheaper inference. CBQ’s performance gains are supported by extensive ablations quantifying the impact of CBD, LoRA-Rounding, and CFP on zero-shot tasks and generation perplexity.

Abstract

Post-training quantization (PTQ) has played a key role in compressing large language models (LLMs) with ultra-low costs. However, existing PTQ methods only focus on handling the outliers within one layer or one block, which ignores the dependency of blocks and leads to severe performance degradation in low-bit settings. In this paper, we propose CBQ, a cross-block reconstruction-based PTQ method for LLMs. CBQ employs a cross-block dependency using a homologous reconstruction scheme, establishing long-range dependencies across multiple blocks to minimize error accumulation. Furthermore, CBQ incorporates a coarse-to-fine preprocessing (CFP) strategy for suppressing weight and activation outliers, coupled with an adaptive LoRA-Rounding technique for precise weight quantization. These innovations enable CBQ to not only handle extreme outliers effectively but also improve overall quantization accuracy. Extensive experiments show that CBQ achieves superior low-bit quantization (W4A4, W4A8, W2A16) and outperforms existing state-of-the-art methods across various LLMs and datasets. Notably, CBQ quantizes the 4-bit LLAMA1-65B model within only 4.3 hours on a single GPU, achieving a commendable tradeoff between performance and quantization efficiency.
Paper Structure (43 sections, 13 equations, 4 figures, 20 tables, 1 algorithm)

This paper contains 43 sections, 13 equations, 4 figures, 20 tables, 1 algorithm.

Figures (4)

  • Figure 1: (a) Visualization of the absolute values of the Hessian matrix for weights within a single layer of LLAMA-7B, (b) Hessian matrix visualization of the loss with respect to the scale across 32 layers of LLAMA-7B, and (c) the relationship between the average scale of the first two transformer blocks in LLAMA-7B and the corresponding loss. The term "scale" here refers to the quantization step size.
  • Figure 2: Workflow of the proposed CBQ. CBQ firstly utilizes a coarse-to-fine preprocessing to handle the outliers of weights and activations, and then employs a cross-block optimization strategy to learn quantization step sizes and weight adaptive rounding matrices with supervision from the corresponding full-precision model. This sequential block-wise method minimizes aggregate error propagation through cross-block dependency modeling.
  • Figure 3: Outliers pre-processing for weights and activations. The red dashed line indicates the truncation threshold for weight outliers, and the deep blue line represents the reserved subset. The light blue boxes depict activation outliers that undergo per-channel scaling.
  • Figure 4: Visualization of the absolute values of the Hessian matrix for LLaMA3-8B (left) and OPT-7B (right), highlighting intra-layer (top) and inter-layer (bottom) dependencies under 4-bit and 2-bit quantization. Consistent patterns can be observed across both models.