Table of Contents
Fetching ...

1-Bit Wonder: Improving QAT Performance in the Low-Bit Regime through K-Means Quantization

Sohir Maskey, Constantin Eichenberg, Johannes Messner, Douglas Orr

TL;DR

It is shown that k-means based weight quantization outperforms integer formats and can be implemented efficiently on standard hardware, and that, under a fixed inference memory budget, the best performance on generative downstream tasks is achieved with $1$-bit quantized weights.

Abstract

Quantization-aware training (QAT) is an effective method to drastically reduce the memory footprint of LLMs while keeping performance degradation at an acceptable level. However, the optimal choice of quantization format and bit-width presents a challenge in practice. The full design space of quantization is not fully explored in the context of QAT, and the precise trade-off between quantization and downstream performance is poorly understood, as comparisons often rely solely on perplexity-based evaluations. In this work, we address these shortcomings with an empirical study of QAT in the low-bit regime. We show that k-means based weight quantization outperforms integer formats and can be implemented efficiently on standard hardware. Furthermore, we find that, under a fixed inference memory budget, the best performance on generative downstream tasks is achieved with $1$-bit quantized weights.

1-Bit Wonder: Improving QAT Performance in the Low-Bit Regime through K-Means Quantization

TL;DR

It is shown that k-means based weight quantization outperforms integer formats and can be implemented efficiently on standard hardware, and that, under a fixed inference memory budget, the best performance on generative downstream tasks is achieved with -bit quantized weights.

Abstract

Quantization-aware training (QAT) is an effective method to drastically reduce the memory footprint of LLMs while keeping performance degradation at an acceptable level. However, the optimal choice of quantization format and bit-width presents a challenge in practice. The full design space of quantization is not fully explored in the context of QAT, and the precise trade-off between quantization and downstream performance is poorly understood, as comparisons often rely solely on perplexity-based evaluations. In this work, we address these shortcomings with an empirical study of QAT in the low-bit regime. We show that k-means based weight quantization outperforms integer formats and can be implemented efficiently on standard hardware. Furthermore, we find that, under a fixed inference memory budget, the best performance on generative downstream tasks is achieved with -bit quantized weights.
Paper Structure (70 sections, 33 equations, 9 figures, 7 tables)

This paper contains 70 sections, 33 equations, 9 figures, 7 tables.

Figures (9)

  • Figure 1: IsoLoss contours under precision--budget tradeoffs. (Left) Precision--parameter tradeoff under a fixed 50B token budget. (Right) Precision--token tradeoff under a fixed 3.9B parameter budget. Background colors show the predicted loss gap between uniform integer quantization and k-means quantization (red indicates lower loss for k-means). Across all evaluated regimes, k-means strictly dominates uniform quantization, highlighting the consistent advantage of nonlinear formats under fixed memory budgets.
  • Figure 2: Precision-to-capacity mapping and the induced memory-normalized efficiency. (Left) The saturating function $f(P_w)$ models diminishing returns as weight precision increases, where higher values are better. (Right) The ratio $g(P_w)=f(P_w)/P_w$ determines the optimal precision under a fixed inference-memory budget $M = N P_w$, where higher values are better.
  • Figure 3: A theoretical model for speedup of $1$-bit and $4$-bit formats versus bf16, when decoded to bf16 in software on an L40S GPU.
  • Figure 4: Normalization ablation. Training loss curves with logarithmic $y$-axis for different block-wise normalization strategies. Left: 2-bit quantization, where $\mathrm{absmean}$ scaling leads to substantially more stable optimization. Right:$4$-bit quantization, where $\mathrm{absmax}$ scaling becomes favorable. K-means quantization remains largely insensitive to this choice.
  • Figure 5: Training progression Evolution of pretraining evaluations during the training of the $4$B bf16, $12$B $4$-bit, and $31$B $1$-bit models.
  • ...and 4 more figures