Table of Contents
Fetching ...

AnyBCQ: Hardware Efficient Flexible Binary-Coded Quantization for Multi-Precision LLMs

Gunho Park, Jeongin Bae, Beomseok Kwon, Byeongwook Kim, Se Jung Kwon, Dongsoo Lee

TL;DR

AnyBCQ tackles the hardware-efficient deployment of multi-precision LLMs by extending Binary-Coded Quantization (BCQ) to operate directly on binary bit-planes and support progressive precision expansion. It introduces a co-design that includes a BCQ-based representation with per-precision scaling factors and a dedicated CUDA kernel that avoids centroid lookups and bit-transpose overhead, enabling dynamic per-request precision with low overhead. Empirically, AnyBCQ achieves strong $2$-bit accuracy, competitive $3$–$4$-bit performance, and substantial throughput gains (up to $3.0×$ over FP16) while reducing memory footprint relative to multi-model baselines. The approach provides a practical, single-model solution for diverse service-level objectives in LLM inference across different model sizes and workloads.

Abstract

The deployment of large language models (LLMs) is increasingly constrained by memory and latency bottlenecks, motivating the need for quantization techniques that flexibly balance accuracy and efficiency. Recent work has introduced multi-precision models, which enable inference at multiple precisions within a single model depending on runtime constraints. To support such flexibility, quantized weights are often stored as bit-planes, where hardware efficiency improves when the compute operates directly at the bit-plane level and activates only the precision required by each request. In this work, we present AnyBCQ, a hardware-friendly multi-precision extension of Binary-Coded Quantization (BCQ) that supports direct bit-plane operations. By representing weights as binary bit-planes with corresponding scale factors, AnyBCQ enables bit-plane-level computation and maps naturally to accelerator-friendly, bit-parallel arithmetic. Our progressive precision expansion mechanism incrementally refines scaling factors while reusing previously assigned binary codes, yielding monotonic improvements in accuracy as additional bits are enabled. We further co-design a specialized kernel that exploits the BCQ structure to support dynamic per-request precision selection with negligible overhead. Experiments on recent LLMs demonstrate that AnyBCQ significantly narrows the accuracy drop in the low-bit regime (e.g. 2-bit), remains competitive at higher precision, and achieves throughput gains of up to 3.0x over half precision and 1.2x over state-of-the-art multi-precision methods. By aligning algorithmic flexibility with hardware efficiency, AnyBCQ provides a practical foundation for multi-precision LLM deployment across diverse service-level objectives.

AnyBCQ: Hardware Efficient Flexible Binary-Coded Quantization for Multi-Precision LLMs

TL;DR

AnyBCQ tackles the hardware-efficient deployment of multi-precision LLMs by extending Binary-Coded Quantization (BCQ) to operate directly on binary bit-planes and support progressive precision expansion. It introduces a co-design that includes a BCQ-based representation with per-precision scaling factors and a dedicated CUDA kernel that avoids centroid lookups and bit-transpose overhead, enabling dynamic per-request precision with low overhead. Empirically, AnyBCQ achieves strong -bit accuracy, competitive -bit performance, and substantial throughput gains (up to over FP16) while reducing memory footprint relative to multi-model baselines. The approach provides a practical, single-model solution for diverse service-level objectives in LLM inference across different model sizes and workloads.

Abstract

The deployment of large language models (LLMs) is increasingly constrained by memory and latency bottlenecks, motivating the need for quantization techniques that flexibly balance accuracy and efficiency. Recent work has introduced multi-precision models, which enable inference at multiple precisions within a single model depending on runtime constraints. To support such flexibility, quantized weights are often stored as bit-planes, where hardware efficiency improves when the compute operates directly at the bit-plane level and activates only the precision required by each request. In this work, we present AnyBCQ, a hardware-friendly multi-precision extension of Binary-Coded Quantization (BCQ) that supports direct bit-plane operations. By representing weights as binary bit-planes with corresponding scale factors, AnyBCQ enables bit-plane-level computation and maps naturally to accelerator-friendly, bit-parallel arithmetic. Our progressive precision expansion mechanism incrementally refines scaling factors while reusing previously assigned binary codes, yielding monotonic improvements in accuracy as additional bits are enabled. We further co-design a specialized kernel that exploits the BCQ structure to support dynamic per-request precision selection with negligible overhead. Experiments on recent LLMs demonstrate that AnyBCQ significantly narrows the accuracy drop in the low-bit regime (e.g. 2-bit), remains competitive at higher precision, and achieves throughput gains of up to 3.0x over half precision and 1.2x over state-of-the-art multi-precision methods. By aligning algorithmic flexibility with hardware efficiency, AnyBCQ provides a practical foundation for multi-precision LLM deployment across diverse service-level objectives.

Paper Structure

This paper contains 16 sections, 4 figures, 4 tables, 1 algorithm.

Figures (4)

  • Figure 1: Overview of AnyBCQ: (a) weights are first quantized to a base precision and progressively expanded to higher precisions by reusing the existing binary codes while adding residual bit-planes; (b) $p$-bit inference reconstructs weights by combining the corresponding scaling factors with the first $p$ binary bit-planes. In the binary representation, elements denoted as 0 are mapped as -1.
  • Figure 2: Illustration of the binary-coding quantization scheme. Weights are quantized hierarchically, with each bit level determining its corresponding binary values. At each level, the scaling factor and binary assignment are computed and accumulated with the value obtained from the previous bit level to approximate the original weight. The resulting representation comprises bit-planes for each precision level, each paired with its associated scaling factors.
  • Figure 3: Matrix multiplication with (a) clustering-based quantization, which requires bit-plane transposition and centroid lookups, and (b) the proposed AnyBCQ kernel, which directly operates on binary bit-planes with scaling factors for hardware-efficient, dynamic-precision inference.
  • Figure 4: Accuracy–throughput trade-offs for 2-, 3-, and 4-bit configurations across models. The rightmost point denotes the 2-bit setting. For a given accuracy, AnyBCQ attains higher throughput (tokens/sec) than Any-Precision LLM (AP), with the largest gain at 2 bits.