Table of Contents
Fetching ...

Task-Stratified Knowledge Scaling Laws for Post-Training Quantized Large Language Models

Chenxi Zhou, Pengfei Cao, Jiang Li, Bohan Yu, Jinyu Ye, Jun Zhao, Kang Liu

TL;DR

The paper addresses how post-training quantization affects diverse knowledge capabilities in LLMs by introducing Task-Stratified Knowledge Scaling Laws that jointly model model size, bit-width, calibration data, and quantization granularity. It defines a three-tier knowledge taxonomy (KM, KA, KR) and proposes a four-factor power-law framework with task-specific exponents, normalized accuracy, and a linearizable fitting procedure. Empirical validation across 293 PTQ configurations on Qwen3 and Llama-3 demonstrates strong fits (Adj. R^2 close to 0.95) and reveals distinct sensitivities: KR is precision-critical, KA scales with model size, and KM relies on calibration data, with phase-transition behavior at 2-bit. The work provides a practical, knowledge-aware quantization design space to mitigate performance collapse in low-bit regimes and offers cross-architecture generalization evidence essential for deployment decisions.

Abstract

Post-Training Quantization (PTQ) is a critical strategy for efficient Large Language Models (LLMs) deployment. However, existing scaling laws primarily focus on general performance, overlooking crucial fine-grained factors and how quantization differentially impacts diverse knowledge capabilities. To address this, we establish Task-Stratified Knowledge Scaling Laws. By stratifying capabilities into memorization, application, and reasoning, we develop a framework that unifies model size, bit-width, and fine-grained factors: group size and calibration set size. Validated on 293 diverse PTQ configurations, our framework demonstrates strong fit and cross-architecture consistency. It reveals distinct sensitivities across knowledge capabilities: reasoning is precision-critical, application is scale-responsive, and memorization is calibration-sensitive. We highlight that in low-bit scenarios, optimizing these fine-grained factors is essential for preventing performance collapse. These findings provide an empirically-backed foundation for designing knowledge-aware quantization strategies.

Task-Stratified Knowledge Scaling Laws for Post-Training Quantized Large Language Models

TL;DR

The paper addresses how post-training quantization affects diverse knowledge capabilities in LLMs by introducing Task-Stratified Knowledge Scaling Laws that jointly model model size, bit-width, calibration data, and quantization granularity. It defines a three-tier knowledge taxonomy (KM, KA, KR) and proposes a four-factor power-law framework with task-specific exponents, normalized accuracy, and a linearizable fitting procedure. Empirical validation across 293 PTQ configurations on Qwen3 and Llama-3 demonstrates strong fits (Adj. R^2 close to 0.95) and reveals distinct sensitivities: KR is precision-critical, KA scales with model size, and KM relies on calibration data, with phase-transition behavior at 2-bit. The work provides a practical, knowledge-aware quantization design space to mitigate performance collapse in low-bit regimes and offers cross-architecture generalization evidence essential for deployment decisions.

Abstract

Post-Training Quantization (PTQ) is a critical strategy for efficient Large Language Models (LLMs) deployment. However, existing scaling laws primarily focus on general performance, overlooking crucial fine-grained factors and how quantization differentially impacts diverse knowledge capabilities. To address this, we establish Task-Stratified Knowledge Scaling Laws. By stratifying capabilities into memorization, application, and reasoning, we develop a framework that unifies model size, bit-width, and fine-grained factors: group size and calibration set size. Validated on 293 diverse PTQ configurations, our framework demonstrates strong fit and cross-architecture consistency. It reveals distinct sensitivities across knowledge capabilities: reasoning is precision-critical, application is scale-responsive, and memorization is calibration-sensitive. We highlight that in low-bit scenarios, optimizing these fine-grained factors is essential for preventing performance collapse. These findings provide an empirically-backed foundation for designing knowledge-aware quantization strategies.

Paper Structure

This paper contains 36 sections, 5 equations, 7 figures, 5 tables.

Figures (7)

  • Figure 1: Overview of the task-stratified knowledge taxonomy defined in this study.
  • Figure 2: Scaling trends of Model Size ($N$) and Bit-width ($B$) for Qwen3 models ($C_b=128, G=128$). Accuracy is averaged across five representative 4-choice tasks: Hellaswag, ARC-e/c, MMLU, and OpenbookQA. The dashed grey line represents the random baseline (0.25).
  • Figure 3: Scaling trends of Calibration Set Size ($C_b$) and Group Size ($G$) under 3-bit quantization. Benchmarks are the same as in Figure \ref{['fig:macro_trends']}. (Left) Impact of $C_b$ with fixed $G=128$. (Right) Impact of $G$ with fixed $C_b=128$.
  • Figure 4: Goodness-of-fit: Predicted vs. actual normalized accuracy for (Left) our proposed four-factor law ($N, B, C_b, G$) and (Right) the baseline ($N, B$). Points are colored by bit-width ($B$) and sized by model size ($N$). Stars ($\star$) denote the validation data (Qwen3-32B). Dashed line represents ideal prediction.
  • Figure 5: Performance surface of the General Scaling Law in the 3-bit region ($\mathrm{Acc}_{\text{norm}} = \exp\left[ - 966.56 \cdot N^{-0.322} (\log_2 C_b)^{-0.103} G^{0.117} \right]$, Adj.$R^2_{\mathcal{O}} = 0.97$). Points represent empirical data.
  • ...and 2 more figures