Table of Contents
Fetching ...

CKA-Guided Modular Quantization: Beyond Bit-Width to Algorithmic Diversity

Jinhao Zhang, Yunquan Zhang, Daning Chen

TL;DR

The paper tackles the inefficiency of uniform post-training quantization by introducing a CKA-guided modular quantization framework that selects different PTQ algorithms for each transformer layer. It leverages Linear Centered Kernel Alignment to measure how well each per-layer quantized variant preserves the full-precision representation, then uses a greedy layer-wise strategy to assemble a heterogeneously quantized model without retraining. The approach demonstrates superior perplexity and downstream task performance across LLaMA and Qwen models compared with uniform PTQ and fixed mixed-precision baselines, while incurring only modest offline calibration costs and no online latency increase. This work highlights the importance of algorithmic diversity, rather than solely bit-width heterogeneity, for efficient and accurate LLM quantization in a training-free, plug-and-play manner.

Abstract

Current mainstream post-training quantization methods for large language models typically apply a uniform quantization strategy across all network layers, overlooking the substantial differences in algorithmic suitability among layers. To address this limitation, we propose CKA Guided Modular Quantization, a fine-tuning-free, plug-and-play framework for algorithmic heterogeneous quantization. Our method independently evaluates multiple PTQ algorithms on each layer and employs Linear Centered Kernel Alignment (CKA) as a metric to automatically select the optimal quantization strategy per layer. The individually optimized strategies are then integrated to construct a hybrid quantized model. Experiments demonstrate that our approach consistently outperforms both uniform quantization baselines and state-of-the-art mixed-precision methods across mainstream LLMs including LLaMA and Qwen ,in terms of perplexity (PPL) and downstream task performance.

CKA-Guided Modular Quantization: Beyond Bit-Width to Algorithmic Diversity

TL;DR

The paper tackles the inefficiency of uniform post-training quantization by introducing a CKA-guided modular quantization framework that selects different PTQ algorithms for each transformer layer. It leverages Linear Centered Kernel Alignment to measure how well each per-layer quantized variant preserves the full-precision representation, then uses a greedy layer-wise strategy to assemble a heterogeneously quantized model without retraining. The approach demonstrates superior perplexity and downstream task performance across LLaMA and Qwen models compared with uniform PTQ and fixed mixed-precision baselines, while incurring only modest offline calibration costs and no online latency increase. This work highlights the importance of algorithmic diversity, rather than solely bit-width heterogeneity, for efficient and accurate LLM quantization in a training-free, plug-and-play manner.

Abstract

Current mainstream post-training quantization methods for large language models typically apply a uniform quantization strategy across all network layers, overlooking the substantial differences in algorithmic suitability among layers. To address this limitation, we propose CKA Guided Modular Quantization, a fine-tuning-free, plug-and-play framework for algorithmic heterogeneous quantization. Our method independently evaluates multiple PTQ algorithms on each layer and employs Linear Centered Kernel Alignment (CKA) as a metric to automatically select the optimal quantization strategy per layer. The individually optimized strategies are then integrated to construct a hybrid quantized model. Experiments demonstrate that our approach consistently outperforms both uniform quantization baselines and state-of-the-art mixed-precision methods across mainstream LLMs including LLaMA and Qwen ,in terms of perplexity (PPL) and downstream task performance.

Paper Structure

This paper contains 25 sections, 3 equations, 5 figures, 6 tables, 1 algorithm.

Figures (5)

  • Figure 1: Comparison between previous multi-precision quantization and our proposed multi-quantization-methods approach. Left: existing methods use different bit-widths (e.g., low vs. high precision) but a single quantization algorithm across layers. Right: our method applies diverse quantization algorithms (e.g., GPTQ, AWQ, SmoothQuant) to different layers, enabling algorithmic heterogeneity for improved performance.
  • Figure 2: Overview of our CKA-Guided Modular Quantization Framework. (a) We first analyze layer-wise sensitivity using CKA. (b) Then, we competitively select the optimal quantization method (e.g., GPTQ & SmoothQuant) for each layer. (c) Finally, we integrate these layers into a unified model. This framework achieves optimal heterogeneity at the algorithmic level without retraining.
  • Figure 3: Layer-wise CKA score distribution of Llama-3-8B under different quantization methods.
  • Figure 4: Layer-wise method selection results derived from our CKA-guided framework.
  • Figure 5: Layer-wise method selection results derived from our CKA-guided framework.