Table of Contents
Fetching ...

TurboBoA: Faster and Exact Attention-aware Quantization without Backpropagation

Junhan Kim, Yeo Jeong Park, Seungwoo Son, Chungman Lee, Ho-young Kim, Joonyoung Kim, Yongkweon Jeon

TL;DR

TurboBoA addresses the efficiency-accuracy trade-off in post-training quantization for large language models by extending BoA with three innovations: simultaneous out-channel quantization with a closed-form error-compensation rule, residual error correction for inputs distorted by earlier quantizations, and adaptive grid selection with coordinate-descent refinement. The approach remains backpropagation-free, preserving the computational advantages of GPTQ while capturing inter-channel dependencies within attention modules. Empirical results show more than a threefold speedup over BoA and consistent accuracy gains across weight-only and weight-activation quantization, with state-of-the-art performance when combined with outlier suppression methods. This work enables faster, more accurate quantization of billion-scale models, facilitating practical deployment on constrained hardware without gradient-based fine-tuning.

Abstract

The rapid growth of large language models (LLMs) has heightened the importance of post-training quantization (PTQ) for reducing memory and computation costs. Among PTQ methods, GPTQ has gained significant attention for its efficiency, enabling billion-scale LLMs to be quantized within a few GPU hours. However, GPTQ's assumption of layer-wise independence leads to severe accuracy drops in low-bit regimes. Recently, BoA improved upon GPTQ by incorporating inter-layer dependencies within attention modules, but its reliance on sequential quantization across all out-channels makes it substantially less efficient. In this paper, we propose TurboBoA, a new backpropagation-free PTQ algorithm that preserves the accuracy benefits of BoA while significantly accelerating the process. The proposed TurboBoA introduces three key innovations: (i) joint quantization of multiple out-channels with a closed-form error compensation rule, which reduces sequential bottlenecks and yields more than a three-fold speedup; (ii) a correction mechanism for errors propagated from preceding quantized layers; and (iii) adaptive grid computation with coordinate descent refinement to maintain alignment during iterative updates. Extensive experiments demonstrate that TurboBoA delivers substantial acceleration over BoA while consistently improving accuracy. When combined with outlier suppression techniques, it achieves state-of-the-art results in both weight-only and weight-activation quantization. The code will be available at https://github.com/SamsungLabs/TurboBoA.

TurboBoA: Faster and Exact Attention-aware Quantization without Backpropagation

TL;DR

TurboBoA addresses the efficiency-accuracy trade-off in post-training quantization for large language models by extending BoA with three innovations: simultaneous out-channel quantization with a closed-form error-compensation rule, residual error correction for inputs distorted by earlier quantizations, and adaptive grid selection with coordinate-descent refinement. The approach remains backpropagation-free, preserving the computational advantages of GPTQ while capturing inter-channel dependencies within attention modules. Empirical results show more than a threefold speedup over BoA and consistent accuracy gains across weight-only and weight-activation quantization, with state-of-the-art performance when combined with outlier suppression methods. This work enables faster, more accurate quantization of billion-scale models, facilitating practical deployment on constrained hardware without gradient-based fine-tuning.

Abstract

The rapid growth of large language models (LLMs) has heightened the importance of post-training quantization (PTQ) for reducing memory and computation costs. Among PTQ methods, GPTQ has gained significant attention for its efficiency, enabling billion-scale LLMs to be quantized within a few GPU hours. However, GPTQ's assumption of layer-wise independence leads to severe accuracy drops in low-bit regimes. Recently, BoA improved upon GPTQ by incorporating inter-layer dependencies within attention modules, but its reliance on sequential quantization across all out-channels makes it substantially less efficient. In this paper, we propose TurboBoA, a new backpropagation-free PTQ algorithm that preserves the accuracy benefits of BoA while significantly accelerating the process. The proposed TurboBoA introduces three key innovations: (i) joint quantization of multiple out-channels with a closed-form error compensation rule, which reduces sequential bottlenecks and yields more than a three-fold speedup; (ii) a correction mechanism for errors propagated from preceding quantized layers; and (iii) adaptive grid computation with coordinate descent refinement to maintain alignment during iterative updates. Extensive experiments demonstrate that TurboBoA delivers substantial acceleration over BoA while consistently improving accuracy. When combined with outlier suppression techniques, it achieves state-of-the-art results in both weight-only and weight-activation quantization. The code will be available at https://github.com/SamsungLabs/TurboBoA.
Paper Structure (33 sections, 3 theorems, 27 equations, 1 figure, 9 tables, 1 algorithm)

This paper contains 33 sections, 3 theorems, 27 equations, 1 figure, 9 tables, 1 algorithm.

Key Result

Proposition 3.1

Let $\mathbf{W}$ be a matrix whose Hessian is given as $\mathbf{H} = \mathbf{H}_{in} \otimes \mathbf{H}_{out}$. Suppose the first $N$ out-channels of $\mathbf{W}$ have been quantized simultaneously and the other out-channels are updated to minimize the attention reconstruction error in eq:error corr where $B = \{0, \ldots, N-1\}$ and $\mathbf{U}_{out} = \operatornamewithlimits{Chol}(\mathbf{H}_{ou

Figures (1)

  • Figure 1: Quantization orders in GPTQ, BoA, and the proposed TurboBoA. (a) GPTQ quantizes all out-channels jointly but without error correction. (b) BoA compensates for the quantization error but requires fully sequential processing across out-channels. (c) TurboBoA reduces sequential operations by quantizing multiple $N$ out-channels jointly while still applying error compensation.

Theorems & Definitions (3)

  • Proposition 3.1
  • Proposition 3.2
  • Proposition 3.3