Table of Contents
Fetching ...

Scaling Law for Quantization-Aware Training

Mengzhao Chen, Chaoyi Zhang, Jing Liu, Yutao Zeng, Zeyue Xue, Zhiheng Liu, Yunshui Li, Jin Ma, Jie Huang, Xun Zhou, Ping Luo

TL;DR

The paper addresses the scaling behavior of quantization-aware training (QAT) at ultra-low bit-widths (W4A4) for large language models by proposing a unified scaling law that explicitly models quantization error as a function of model size $N$, training tokens $D$, and quantization granularity $G$. It introduces the quantization error term $\delta_p(N,D,G) = \frac{k \cdot D^{\gamma_D} \cdot (\log_2(G))^{\gamma_G}}{N^{\gamma_N}}$ and combines it with the Chinchilla loss to yield $L(N,D,G) = \frac{A}{N^\alpha} + \frac{B}{D^\beta} + E + \delta_p(N,D,G)$, validating the model across 268 QAT experiments. A key finding is the decomposition of $\delta_{W4A4}$ into weight and activation errors, with activation bottlenecks in FC2 input largely driven by outliers; mitigations via mixed-precision (e.g., FC2 input 8-bit) align the two error sources and improve predictions. The results show that with more data, weight quantization error can dominate, underscoring the need to balance both errors; practical implications include guiding mixed-precision strategies and highlighting the FC2 bottleneck as a critical target for future QAT improvements. Overall, the proposed law provides a more accurate, data-aware framework for designing efficient QAT regimes and informs decisions about precision, granularity, and data budgets for scalable quantized LLMs.

Abstract

Large language models (LLMs) demand substantial computational and memory resources, creating deployment challenges. Quantization-aware training (QAT) addresses these challenges by reducing model precision while maintaining performance. However, the scaling behavior of QAT, especially at 4-bit precision (W4A4), is not well understood. Existing QAT scaling laws often ignore key factors such as the number of training tokens and quantization granularity, which limits their applicability. This paper proposes a unified scaling law for QAT that models quantization error as a function of model size, training data volume, and quantization group size. Through 268 QAT experiments, we show that quantization error decreases as model size increases, but rises with more training tokens and coarser quantization granularity. To identify the sources of W4A4 quantization error, we decompose it into weight and activation components. Both components follow the overall trend of W4A4 quantization error, but with different sensitivities. Specifically, weight quantization error increases more rapidly with more training tokens. Further analysis shows that the activation quantization error in the FC2 layer, caused by outliers, is the primary bottleneck of W4A4 QAT quantization error. By applying mixed-precision quantization to address this bottleneck, we demonstrate that weight and activation quantization errors can converge to similar levels. Additionally, with more training data, weight quantization error eventually exceeds activation quantization error, suggesting that reducing weight quantization error is also important in such scenarios. These findings offer key insights for improving QAT research and development.

Scaling Law for Quantization-Aware Training

TL;DR

The paper addresses the scaling behavior of quantization-aware training (QAT) at ultra-low bit-widths (W4A4) for large language models by proposing a unified scaling law that explicitly models quantization error as a function of model size , training tokens , and quantization granularity . It introduces the quantization error term and combines it with the Chinchilla loss to yield , validating the model across 268 QAT experiments. A key finding is the decomposition of into weight and activation errors, with activation bottlenecks in FC2 input largely driven by outliers; mitigations via mixed-precision (e.g., FC2 input 8-bit) align the two error sources and improve predictions. The results show that with more data, weight quantization error can dominate, underscoring the need to balance both errors; practical implications include guiding mixed-precision strategies and highlighting the FC2 bottleneck as a critical target for future QAT improvements. Overall, the proposed law provides a more accurate, data-aware framework for designing efficient QAT regimes and informs decisions about precision, granularity, and data budgets for scalable quantized LLMs.

Abstract

Large language models (LLMs) demand substantial computational and memory resources, creating deployment challenges. Quantization-aware training (QAT) addresses these challenges by reducing model precision while maintaining performance. However, the scaling behavior of QAT, especially at 4-bit precision (W4A4), is not well understood. Existing QAT scaling laws often ignore key factors such as the number of training tokens and quantization granularity, which limits their applicability. This paper proposes a unified scaling law for QAT that models quantization error as a function of model size, training data volume, and quantization group size. Through 268 QAT experiments, we show that quantization error decreases as model size increases, but rises with more training tokens and coarser quantization granularity. To identify the sources of W4A4 quantization error, we decompose it into weight and activation components. Both components follow the overall trend of W4A4 quantization error, but with different sensitivities. Specifically, weight quantization error increases more rapidly with more training tokens. Further analysis shows that the activation quantization error in the FC2 layer, caused by outliers, is the primary bottleneck of W4A4 QAT quantization error. By applying mixed-precision quantization to address this bottleneck, we demonstrate that weight and activation quantization errors can converge to similar levels. Additionally, with more training data, weight quantization error eventually exceeds activation quantization error, suggesting that reducing weight quantization error is also important in such scenarios. These findings offer key insights for improving QAT research and development.

Paper Structure

This paper contains 21 sections, 19 equations, 16 figures, 4 tables.

Figures (16)

  • Figure 1: Quantization error contour based on the proposed unified QAT scaling law. The quantization error decreases as the model size increases, but increases with both the number of training tokens and with coarser quantization granularity.
  • Figure 2: Integer (INT4) vs. floating-point (FP4) in W4A4, 297M model, 50B tokens.
  • Figure 3: $\delta_{W4A4}$ at different learning rates, W4A4 ($G=128$) 145M model, 20B tokens.
  • Figure 4: Trend of $\delta_{W4A4}$ with varying $N$, $D$, and $G$. (a) $\delta_{W4A4}$ decreases as model size increases. (b) $\delta_{W4A4}$ increases with more training tokens. (c) $\delta_{W4A4}$ decreases with smaller group sizes. Note that these trends of $\delta_{W4A4}$ are consistent across different $N$, $D$, and $G$. For simplicity, we merely plot the model trained with 100B tokens in (a), a model size of 594M in (b), and the 594M model trained with 100B tokens in (c).
  • Figure 5: Fitting performance of $\delta_{W4A4}$ scaling laws.
  • ...and 11 more figures