Calibration and Transformation-Free Weight-Only LLMs Quantization via Dynamic Grouping
Xinzhe Zheng, Zhen-Qun Yang, Zishan Liu, Haoran Xie, S. Joe Qin, Arlene Chen, Fangzhen Lin
TL;DR
MSB introduces calibration-free, transformation-free weight quantization for LLMs by generalizing binary quantization to multi-bit settings through dynamic grouping that minimizes within-group variance. It formulates a variance-based objective cost(G) and provides four solvers (DG, GG, WGM, WGM-LO) with clear accuracy-runtime trade-offs and a lambda-based boundary to control group sizes. The approach is validated on CPU across Llama 3.2, Falcon 3, and Gemma with 4-bit block-wise and 6-bit per-tensor settings, achieving competitive QA and perplexity scores against calibration-based baselines while eliminating calibration data and transformations. The work highlights a practical path toward CPU-friendly, calibration-free PTQ for large models, though it relies on simulation and leaves room for optimized kernels and optional calibration/transformations in future work.
Abstract
Large Language Models (LLMs) deliver strong performance but are difficult to deploy under tight memory and compute constraints. Low-bit post-training quantization (PTQ) is a promising direction; however, it typically relies on calibration data, auxiliary transformations, and GPU tools. To address these limitations, we propose MSB (Multi Scale Binary), a calibration-free and transformation-free PTQ method that generalizes binary quantization to multi-bit settings. MSB optimizes a dynamic grouping criterion that minimizes within group variance, yielding group-wise multiscale levels that can be applied consistently across granularities from per tensor to block-wise configurations with 64 elements groups per row, without calibration or intermediate transforms. We implement the optimization in a CPU based solver for the quantization step and evaluate using standard bfloat16 execution without low-bit packing. On Llama 3.2 3B, MSB achieves 8.43 perplexity on WikiText-2 under 4-bit weight only block-wise quantization, compared to 7.81 in full precision and 12.23 with GPTQ its default setup. Overall, MSB provides a new optimization perspective for low-bit PTQ while simplifying the pipeline by removing calibration and transformations.
