Table of Contents
Fetching ...

Beyond Outliers: A Data-Free Layer-wise Mixed-Precision Quantization Approach Driven by Numerical and Structural Dual-Sensitivity

Hengyuan Zhang, Xinrong Chen, Zunhai Su, Xiao Liang, Jing Xiong, Wendong Xu, He Xiao, Chaofan Tao, Wei Zhang, Ruobing Xie, Lei Jiang, Hayden Kwok-Hay So, Ngai Wong

Abstract

Layer-wise mixed-precision quantization (LMPQ) enables effective compression under extreme low-bit settings by allocating higher precision to sensitive layers. However, existing methods typically treat all intra-layer weight modules uniformly and rely on a single numerical property when estimating sensitivity, overlooking their distinct operational roles and structural characteristics. To address this, we propose NSDS, a novel calibration-free LMPQ framework driven by Numerical and Structural Dual-Sensitivity. Specifically, it first mechanistically decomposes each layer into distinct operational roles and quantifies their sensitivity from both numerical and structural perspectives. These dual-aspect scores are then aggregated into a unified layer-wise metric through a robust aggregation scheme based on MAD-Sigmoid and Soft-OR to guide bit allocation. Extensive experiments demonstrate that NSDS consistently achieves superior performance compared to various baselines across diverse models and downstream tasks, without relying on any calibration data.

Beyond Outliers: A Data-Free Layer-wise Mixed-Precision Quantization Approach Driven by Numerical and Structural Dual-Sensitivity

Abstract

Layer-wise mixed-precision quantization (LMPQ) enables effective compression under extreme low-bit settings by allocating higher precision to sensitive layers. However, existing methods typically treat all intra-layer weight modules uniformly and rely on a single numerical property when estimating sensitivity, overlooking their distinct operational roles and structural characteristics. To address this, we propose NSDS, a novel calibration-free LMPQ framework driven by Numerical and Structural Dual-Sensitivity. Specifically, it first mechanistically decomposes each layer into distinct operational roles and quantifies their sensitivity from both numerical and structural perspectives. These dual-aspect scores are then aggregated into a unified layer-wise metric through a robust aggregation scheme based on MAD-Sigmoid and Soft-OR to guide bit allocation. Extensive experiments demonstrate that NSDS consistently achieves superior performance compared to various baselines across diverse models and downstream tasks, without relying on any calibration data.
Paper Structure (57 sections, 31 equations, 10 figures, 3 tables, 1 algorithm)

This paper contains 57 sections, 31 equations, 10 figures, 3 tables, 1 algorithm.

Figures (10)

  • Figure 1: Layer-wise sensitivity across two LLMs. Each dot represents a layer. Darker dots indicate more severe perplexity degradation ($\Delta$PPL) when quantizing that layer. See §\ref{['subsec:estimating_sensitivity']} for the calculation of Numerical Vulnerability and Structural Expressiveness. Red dashed boxes highlight that layers with low numerical but high structural sensitivity still suffer significant degradation. See Appendix \ref{['app:nsds_scores']} for more results.
  • Figure 2: Overview of our NSDS framework for data-free layer sensitivity estimation. The left panel illustrates the mechanistic decomposition of a layer into distinct operational components, categorized as Detectors and Writers (§\ref{['subsec:mechanistic_view']}). The middle panel presents the dual-view sensitivity estimation: Numerical Vulnerability (NV) and Structural Expressiveness (SE) (§\ref{['subsec:estimating_sensitivity']}). The right panel shows the robust aggregation process, where MAD-Sigmoid normalizes heterogeneous scores and Soft-OR integrates them into a unified layer-wise sensitivity metric, which is then used for bit allocation under a target budget (§\ref{['subsec:aggregation']}). See Appendix \ref{['app:nsds_algo']} for the algorithmic description of the NSDS framework.
  • Figure 3: Average accuracy of NSDS and baselines on language reasoning benchmarks across different bit budgets for Llama-3.1-8B and Qwen2.5-7B.
  • Figure 4: Average accuracy of the ablation analysis on NSDS. "w/o" denotes the exclusion of a specific component from NSDS during layer sensitivity estimation.
  • Figure 5: Average accuracy comparison between NSDS framework and calibration-based baselines across generic reasoning benchmarks.
  • ...and 5 more figures