BCQ: Block Clustered Quantization for 4-bit (W4A4) LLM Inference
Reena Elangovan, Charbel Sakr, Anand Raghunathan, Brucek Khailany
TL;DR
The paper addresses the challenge of accurate post-training quantization for both weights and activations in large language models at 4-bit precision. It introduces block clustered quantization (BCQ) and a locally optimal variant (LO-BCQ) that iteratively clusters operand blocks and designs per-cluster codebooks to minimize mean squared error, using a small set of universally calibrated codebooks that can be frozen across models. LO-BCQ achieves state-of-the-art trade-offs with <1% loss on downstream tasks and perplexity on Wikitext-103 using W4A4 (effective width around $4.5$–$4.625$ bits), while requiring no weight updates. The approach enables efficient hardware acceleration through small codebooks and per-block metadata, with demonstrated robustness across GPT3, Llama2, and Nemotron models and multiple evaluation tasks, marking a practical advance for PTQ in LLM deployment.
Abstract
Post-training quantization (PTQ) is a promising approach to reducing the storage and computational requirements of large language models (LLMs) without additional training cost. Recent PTQ studies have primarily focused on quantizing only weights to sub-8-bits while maintaining activations at 8-bits or higher. Accurate sub-8-bit quantization for both weights and activations without relying on quantization-aware training remains a significant challenge. We propose a novel quantization method called block clustered quantization (BCQ) wherein each operand tensor is decomposed into blocks (a block is a group of contiguous scalars), blocks are clustered based on their statistics, and a dedicated optimal quantization codebook is designed for each cluster. As a specific embodiment of this approach, we propose a PTQ algorithm called Locally-Optimal BCQ (LO-BCQ) that iterates between the steps of block clustering and codebook design to greedily minimize the quantization mean squared error. When weight and activation scalars are encoded to W4A4 format (with 0.5-bits of overhead for storing scaling factors and codebook selectors), we advance the current state-of-the-art by demonstrating <1% loss in inference accuracy across several LLMs and downstream tasks.
