Table of Contents
Fetching ...

Exploring Layer-wise Information Effectiveness for Post-Training Quantization in Small Language Models

He Xiao, Qingyao Yang, Dirui Xie, Wendong Xu, Zunhai Su, Runming yang, Wenyong Zhou, Haobo Liu, Zhengwu Liu, Ngai Wong

TL;DR

The paper tackles deploying sub-8B language models under extreme low-bit PTQ by linking layer saliency to representational geometry via a singular-value-based compactness metric. It introduces a geometry-driven allocation rule that assigns higher precision to the most information-dense layers, preserving uniform within-layer quantization and standard kernels. Experiments on Qwen3 and LLaMA3.x show near-2-bit accuracy with competitive perplexity and strong zero-shot reasoning while delivering hardware-friendly memory and throughput benefits. The work provides interpretable insights into quantization sensitivity and presents a practical, gradient-free, plug-and-play tool for edge deployment on resource-constrained devices.

Abstract

Large language models with billions of parameters are often over-provisioned: many layers contribute little unique information yet dominate the memory and energy footprint during inference. We present LieQ Layer-wise information effectiveness Quantization, a hardware-native, metric-driven post-training quantization framework that addresses the critical challenge of maintaining accuracy in sub-8B models, model parameters less than 8B, under extreme low-bit compression. LieQ keeps uniform bit-width within each layer while mixing precision across layers, preserving standard multiplication kernels and avoiding irregular memory access, codebooks, or irregular formats at inference time. Our method uncovers a strong correlation between layer-wise functional saliency and representational compactness, revealing that layers with higher training-induced energy concentration are functionally irreplaceable. Leveraging this insight, we propose a purely geometry-driven sensitivity proxy that enables automatic bit-width allocation under a target average-bit budget without expensive gradient updates or inference-based perplexity probing. At sub 2-bit, LieQ consistently reduces the large accuracy gap typically observed for naive 2-bit baselines on Qwen3 and LLaMA3.x families, while retaining standard-kernel efficiency. These properties make LieQ a practical path toward deploying small language models on resource-constrained edge devices. Code will available here: https://github.com/HeXiao-55/LieQ-official.git.

Exploring Layer-wise Information Effectiveness for Post-Training Quantization in Small Language Models

TL;DR

The paper tackles deploying sub-8B language models under extreme low-bit PTQ by linking layer saliency to representational geometry via a singular-value-based compactness metric. It introduces a geometry-driven allocation rule that assigns higher precision to the most information-dense layers, preserving uniform within-layer quantization and standard kernels. Experiments on Qwen3 and LLaMA3.x show near-2-bit accuracy with competitive perplexity and strong zero-shot reasoning while delivering hardware-friendly memory and throughput benefits. The work provides interpretable insights into quantization sensitivity and presents a practical, gradient-free, plug-and-play tool for edge deployment on resource-constrained devices.

Abstract

Large language models with billions of parameters are often over-provisioned: many layers contribute little unique information yet dominate the memory and energy footprint during inference. We present LieQ Layer-wise information effectiveness Quantization, a hardware-native, metric-driven post-training quantization framework that addresses the critical challenge of maintaining accuracy in sub-8B models, model parameters less than 8B, under extreme low-bit compression. LieQ keeps uniform bit-width within each layer while mixing precision across layers, preserving standard multiplication kernels and avoiding irregular memory access, codebooks, or irregular formats at inference time. Our method uncovers a strong correlation between layer-wise functional saliency and representational compactness, revealing that layers with higher training-induced energy concentration are functionally irreplaceable. Leveraging this insight, we propose a purely geometry-driven sensitivity proxy that enables automatic bit-width allocation under a target average-bit budget without expensive gradient updates or inference-based perplexity probing. At sub 2-bit, LieQ consistently reduces the large accuracy gap typically observed for naive 2-bit baselines on Qwen3 and LLaMA3.x families, while retaining standard-kernel efficiency. These properties make LieQ a practical path toward deploying small language models on resource-constrained edge devices. Code will available here: https://github.com/HeXiao-55/LieQ-official.git.

Paper Structure

This paper contains 8 sections, 9 equations, 6 figures, 4 tables.

Figures (6)

  • Figure 1: Layer-wise information taxonomy (each dot corresponds to one layer) across three correlated diagnostic metrics. Smaller models (e.g., Qwen-0.6B) exhibit lower robustness under extreme low-bit quantization, with certain layers being significantly more critical (clustered dots with deeper color) than others. Increasing the model size spreads out and balances the importance across layers.
  • Figure 2: Functional diagnostic measures the drop in perplexity when a layer is removed on Qwen3 family. We use the representational compactness and TopK energy to proxy the significant perplexity loss.
  • Figure 3: Illustration of the mixed-precision schemes. (i) Element-wise quantization with FP16 weights protection. (ii) Group-wise 2-bit quantization with 1-bit and 3-bit weights to balance accuracy and memory footprint. (iii) Block-wise 4-bit quantization within attention blocks in different layers. (iv) LieQ: Only one most significant layer with the most compact information is quantized to 4-bit, while the rest are quantized to 2-bit.
  • Figure 4: Microbenchmark latency of the gate_proj layer for LLaMA-3.2-3B and LLaMA-3.1-8B.
  • Figure 5: Average accuracy difference on language reasoning tasks with various precision configurations on small language models.
  • ...and 1 more figures