Table of Contents
Fetching ...

On the Importance of a Multi-Scale Calibration for Quantization

Seungwoo Son, Ingyu Seong, Junhan Kim, Hyemi Jang, Yongkweon Jeon

TL;DR

This paper identifies a crucial but overlooked factor in PTQ for LLMs: input length strongly shapes the input-side Hessian $H_{in}$, affecting weight sensitivity estimates used for quantization. It introduces MaCa, a length-aware Hessian estimation method that aggregates across multiple sequence lengths and normalizes per-sequence contributions, yielding a richer Hessian $H_{(w)}$ for quantization. MaCa consistently improves GPTQ and GPTAQ performance across Qwen3, Gemma3, and LLaMA3, especially at low bit-widths and on long-context tasks, without increasing calibration costs. This approach bridges a gap in Hessian-based PTQ by addressing the impact of variable input lengths, enhancing practical deployment of large-scale models.

Abstract

Post-training quantization (PTQ) is a cornerstone for efficiently deploying large language models (LLMs), where a small calibration set critically affects quantization performance. However, conventional practices rely on random sequences of fixed length, overlooking the variable-length nature of LLM inputs. Input length directly influences the activation distribution and, consequently, the weight importance captured by the Hessian, which in turn affects quantization outcomes. As a result, Hessian estimates derived from fixed-length calibration may fail to represent the true importance of weights across diverse input scenarios. We propose MaCa (Matryoshka Calibration), a simple yet effective method for length-aware Hessian construction. MaCa (i) incorporates multi-scale sequence length information into Hessian estimation and (ii) regularizes each sequence as an independent sample, yielding a more stable and fruitful Hessian for accurate quantization. Experiments on state-of-the-art LLMs (e.g., Qwen3, Gemma3, LLaMA3) demonstrate that MaCa consistently improves accuracy under low bit quantization, offering a lightweight enhancement compatible with existing PTQ frameworks. To the best of our knowledge, this is the first work to systematically highlight the role of multi-scale calibration in LLM quantization.

On the Importance of a Multi-Scale Calibration for Quantization

TL;DR

This paper identifies a crucial but overlooked factor in PTQ for LLMs: input length strongly shapes the input-side Hessian , affecting weight sensitivity estimates used for quantization. It introduces MaCa, a length-aware Hessian estimation method that aggregates across multiple sequence lengths and normalizes per-sequence contributions, yielding a richer Hessian for quantization. MaCa consistently improves GPTQ and GPTAQ performance across Qwen3, Gemma3, and LLaMA3, especially at low bit-widths and on long-context tasks, without increasing calibration costs. This approach bridges a gap in Hessian-based PTQ by addressing the impact of variable input lengths, enhancing practical deployment of large-scale models.

Abstract

Post-training quantization (PTQ) is a cornerstone for efficiently deploying large language models (LLMs), where a small calibration set critically affects quantization performance. However, conventional practices rely on random sequences of fixed length, overlooking the variable-length nature of LLM inputs. Input length directly influences the activation distribution and, consequently, the weight importance captured by the Hessian, which in turn affects quantization outcomes. As a result, Hessian estimates derived from fixed-length calibration may fail to represent the true importance of weights across diverse input scenarios. We propose MaCa (Matryoshka Calibration), a simple yet effective method for length-aware Hessian construction. MaCa (i) incorporates multi-scale sequence length information into Hessian estimation and (ii) regularizes each sequence as an independent sample, yielding a more stable and fruitful Hessian for accurate quantization. Experiments on state-of-the-art LLMs (e.g., Qwen3, Gemma3, LLaMA3) demonstrate that MaCa consistently improves accuracy under low bit quantization, offering a lightweight enhancement compatible with existing PTQ frameworks. To the best of our knowledge, this is the first work to systematically highlight the role of multi-scale calibration in LLM quantization.
Paper Structure (14 sections, 3 equations, 2 figures, 3 tables, 1 algorithm)

This paper contains 14 sections, 3 equations, 2 figures, 3 tables, 1 algorithm.

Figures (2)

  • Figure 1: Visualization of diagonals of Hessian. Top: GPTQ with fixed length sequences makes limited Hessian diagonals. Bottom: MaCa with varied lengths produces a richer Hessian that captures diverse channel sensitivities.
  • Figure 2: Ratio of Reconstruction Error (GPTQ / MaCa). Histogram of error ratios across all linear layers of Qwen3-4B with 4bit quantized. Values $>1$ mean MaCa has lower reconstruction error, which leads to better quantization.