CKA-Guided Modular Quantization: Beyond Bit-Width to Algorithmic Diversity
Jinhao Zhang, Yunquan Zhang, Daning Chen
TL;DR
The paper tackles the inefficiency of uniform post-training quantization by introducing a CKA-guided modular quantization framework that selects different PTQ algorithms for each transformer layer. It leverages Linear Centered Kernel Alignment to measure how well each per-layer quantized variant preserves the full-precision representation, then uses a greedy layer-wise strategy to assemble a heterogeneously quantized model without retraining. The approach demonstrates superior perplexity and downstream task performance across LLaMA and Qwen models compared with uniform PTQ and fixed mixed-precision baselines, while incurring only modest offline calibration costs and no online latency increase. This work highlights the importance of algorithmic diversity, rather than solely bit-width heterogeneity, for efficient and accurate LLM quantization in a training-free, plug-and-play manner.
Abstract
Current mainstream post-training quantization methods for large language models typically apply a uniform quantization strategy across all network layers, overlooking the substantial differences in algorithmic suitability among layers. To address this limitation, we propose CKA Guided Modular Quantization, a fine-tuning-free, plug-and-play framework for algorithmic heterogeneous quantization. Our method independently evaluates multiple PTQ algorithms on each layer and employs Linear Centered Kernel Alignment (CKA) as a metric to automatically select the optimal quantization strategy per layer. The individually optimized strategies are then integrated to construct a hybrid quantized model. Experiments demonstrate that our approach consistently outperforms both uniform quantization baselines and state-of-the-art mixed-precision methods across mainstream LLMs including LLaMA and Qwen ,in terms of perplexity (PPL) and downstream task performance.
