Table of Contents
Fetching ...

Calibration and Transformation-Free Weight-Only LLMs Quantization via Dynamic Grouping

Xinzhe Zheng, Zhen-Qun Yang, Zishan Liu, Haoran Xie, S. Joe Qin, Arlene Chen, Fangzhen Lin

TL;DR

MSB introduces calibration-free, transformation-free weight quantization for LLMs by generalizing binary quantization to multi-bit settings through dynamic grouping that minimizes within-group variance. It formulates a variance-based objective cost(G) and provides four solvers (DG, GG, WGM, WGM-LO) with clear accuracy-runtime trade-offs and a lambda-based boundary to control group sizes. The approach is validated on CPU across Llama 3.2, Falcon 3, and Gemma with 4-bit block-wise and 6-bit per-tensor settings, achieving competitive QA and perplexity scores against calibration-based baselines while eliminating calibration data and transformations. The work highlights a practical path toward CPU-friendly, calibration-free PTQ for large models, though it relies on simulation and leaves room for optimized kernels and optional calibration/transformations in future work.

Abstract

Large Language Models (LLMs) deliver strong performance but are difficult to deploy under tight memory and compute constraints. Low-bit post-training quantization (PTQ) is a promising direction; however, it typically relies on calibration data, auxiliary transformations, and GPU tools. To address these limitations, we propose MSB (Multi Scale Binary), a calibration-free and transformation-free PTQ method that generalizes binary quantization to multi-bit settings. MSB optimizes a dynamic grouping criterion that minimizes within group variance, yielding group-wise multiscale levels that can be applied consistently across granularities from per tensor to block-wise configurations with 64 elements groups per row, without calibration or intermediate transforms. We implement the optimization in a CPU based solver for the quantization step and evaluate using standard bfloat16 execution without low-bit packing. On Llama 3.2 3B, MSB achieves 8.43 perplexity on WikiText-2 under 4-bit weight only block-wise quantization, compared to 7.81 in full precision and 12.23 with GPTQ its default setup. Overall, MSB provides a new optimization perspective for low-bit PTQ while simplifying the pipeline by removing calibration and transformations.

Calibration and Transformation-Free Weight-Only LLMs Quantization via Dynamic Grouping

TL;DR

MSB introduces calibration-free, transformation-free weight quantization for LLMs by generalizing binary quantization to multi-bit settings through dynamic grouping that minimizes within-group variance. It formulates a variance-based objective cost(G) and provides four solvers (DG, GG, WGM, WGM-LO) with clear accuracy-runtime trade-offs and a lambda-based boundary to control group sizes. The approach is validated on CPU across Llama 3.2, Falcon 3, and Gemma with 4-bit block-wise and 6-bit per-tensor settings, achieving competitive QA and perplexity scores against calibration-based baselines while eliminating calibration data and transformations. The work highlights a practical path toward CPU-friendly, calibration-free PTQ for large models, though it relies on simulation and leaves room for optimized kernels and optional calibration/transformations in future work.

Abstract

Large Language Models (LLMs) deliver strong performance but are difficult to deploy under tight memory and compute constraints. Low-bit post-training quantization (PTQ) is a promising direction; however, it typically relies on calibration data, auxiliary transformations, and GPU tools. To address these limitations, we propose MSB (Multi Scale Binary), a calibration-free and transformation-free PTQ method that generalizes binary quantization to multi-bit settings. MSB optimizes a dynamic grouping criterion that minimizes within group variance, yielding group-wise multiscale levels that can be applied consistently across granularities from per tensor to block-wise configurations with 64 elements groups per row, without calibration or intermediate transforms. We implement the optimization in a CPU based solver for the quantization step and evaluate using standard bfloat16 execution without low-bit packing. On Llama 3.2 3B, MSB achieves 8.43 perplexity on WikiText-2 under 4-bit weight only block-wise quantization, compared to 7.81 in full precision and 12.23 with GPTQ its default setup. Overall, MSB provides a new optimization perspective for low-bit PTQ while simplifying the pipeline by removing calibration and transformations.

Paper Structure

This paper contains 38 sections, 36 equations, 10 figures, 26 tables, 4 algorithms.

Figures (10)

  • Figure 1: Overview of our dynamic grouping framework for calibration-free, transformation-free low-bit PTQ.
  • Figure 2: Algorithm comparison for small matrix, quantization loss against matrix sized $\mathbb{R}^{n \times n}$.
  • Figure 3: Algorithm comparison for large matrix, quantization loss against matrix sized $\mathbb{R}^{n \times n}$.
  • Figure 4: Algorithm comparison for small matrix, time used for quantization on CPU against matrix sized $\mathbb{R}^{n \times n}$.
  • Figure 5: Algorithm comparison for large matrix, time used for quantization on CPU against matrix sized $\mathbb{R}^{n \times n}$.
  • ...and 5 more figures