Table of Contents
Fetching ...

BCQ: Block Clustered Quantization for 4-bit (W4A4) LLM Inference

Reena Elangovan, Charbel Sakr, Anand Raghunathan, Brucek Khailany

TL;DR

The paper addresses the challenge of accurate post-training quantization for both weights and activations in large language models at 4-bit precision. It introduces block clustered quantization (BCQ) and a locally optimal variant (LO-BCQ) that iteratively clusters operand blocks and designs per-cluster codebooks to minimize mean squared error, using a small set of universally calibrated codebooks that can be frozen across models. LO-BCQ achieves state-of-the-art trade-offs with <1% loss on downstream tasks and perplexity on Wikitext-103 using W4A4 (effective width around $4.5$–$4.625$ bits), while requiring no weight updates. The approach enables efficient hardware acceleration through small codebooks and per-block metadata, with demonstrated robustness across GPT3, Llama2, and Nemotron models and multiple evaluation tasks, marking a practical advance for PTQ in LLM deployment.

Abstract

Post-training quantization (PTQ) is a promising approach to reducing the storage and computational requirements of large language models (LLMs) without additional training cost. Recent PTQ studies have primarily focused on quantizing only weights to sub-8-bits while maintaining activations at 8-bits or higher. Accurate sub-8-bit quantization for both weights and activations without relying on quantization-aware training remains a significant challenge. We propose a novel quantization method called block clustered quantization (BCQ) wherein each operand tensor is decomposed into blocks (a block is a group of contiguous scalars), blocks are clustered based on their statistics, and a dedicated optimal quantization codebook is designed for each cluster. As a specific embodiment of this approach, we propose a PTQ algorithm called Locally-Optimal BCQ (LO-BCQ) that iterates between the steps of block clustering and codebook design to greedily minimize the quantization mean squared error. When weight and activation scalars are encoded to W4A4 format (with 0.5-bits of overhead for storing scaling factors and codebook selectors), we advance the current state-of-the-art by demonstrating <1% loss in inference accuracy across several LLMs and downstream tasks.

BCQ: Block Clustered Quantization for 4-bit (W4A4) LLM Inference

TL;DR

The paper addresses the challenge of accurate post-training quantization for both weights and activations in large language models at 4-bit precision. It introduces block clustered quantization (BCQ) and a locally optimal variant (LO-BCQ) that iteratively clusters operand blocks and designs per-cluster codebooks to minimize mean squared error, using a small set of universally calibrated codebooks that can be frozen across models. LO-BCQ achieves state-of-the-art trade-offs with <1% loss on downstream tasks and perplexity on Wikitext-103 using W4A4 (effective width around bits), while requiring no weight updates. The approach enables efficient hardware acceleration through small codebooks and per-block metadata, with demonstrated robustness across GPT3, Llama2, and Nemotron models and multiple evaluation tasks, marking a practical advance for PTQ in LLM deployment.

Abstract

Post-training quantization (PTQ) is a promising approach to reducing the storage and computational requirements of large language models (LLMs) without additional training cost. Recent PTQ studies have primarily focused on quantizing only weights to sub-8-bits while maintaining activations at 8-bits or higher. Accurate sub-8-bit quantization for both weights and activations without relying on quantization-aware training remains a significant challenge. We propose a novel quantization method called block clustered quantization (BCQ) wherein each operand tensor is decomposed into blocks (a block is a group of contiguous scalars), blocks are clustered based on their statistics, and a dedicated optimal quantization codebook is designed for each cluster. As a specific embodiment of this approach, we propose a PTQ algorithm called Locally-Optimal BCQ (LO-BCQ) that iterates between the steps of block clustering and codebook design to greedily minimize the quantization mean squared error. When weight and activation scalars are encoded to W4A4 format (with 0.5-bits of overhead for storing scaling factors and codebook selectors), we advance the current state-of-the-art by demonstrating <1% loss in inference accuracy across several LLMs and downstream tasks.

Paper Structure

This paper contains 27 sections, 17 equations, 10 figures, 17 tables.

Figures (10)

  • Figure 1: Wikitext perplexity loss relative to unquantized baseline vs compression factor of LO-BCQ compared to previous LLM quantization proposals. Here, compression factor is the cumulative number of bits in the weight and activation tensors$^{\ref{['fn:rep_cost']}}$ that need to processed in each layer relative to an unquantized BF16 baseline.
  • Figure 2: Block clustered quantization: Each operand block is first mapped to a cluster based on a mapping function and then each scalar of that block is encoded as a $B$-bit index to the closest entry in the $2^B$-entry codebook associated with that cluster.
  • Figure 3: Overview of LO-BCQ algorithm: The algorithm starts with a set of initial per-cluster codebooks, and then iteratively performs two steps (i) fix per-cluster codebooks and update block clusters and (ii) fix block clusters and update per-cluster codebooks.
  • Figure 4: NMSE of LO-BCQ with naive initialization compared to the proposed initialization. Here LOBCQ is configured with a block array size of $64$ and $16$ codebooks.
  • Figure 5: Block format for LO-BCQ. Each operand block is associated with a $log2(N_c)$-bit selector that selects the best codebook and each scalar is a $4$-bit index that represents the closest value in the selected codebook. Each block array is associated with a $8$-bit scale factor.
  • ...and 5 more figures