Q-Palette: Fractional-Bit Quantizers Toward Optimal Bit Allocation for Efficient LLM Deployment

Deokjae Lee; Hyun Oh Song

Q-Palette: Fractional-Bit Quantizers Toward Optimal Bit Allocation for Efficient LLM Deployment

Deokjae Lee, Hyun Oh Song

TL;DR

This work tackles weight-only post-training quantization (PTQ) for large language models in memory-constrained environments by Gaussianizing weight distributions and deriving an information-theoretic optimal fractional-bit allocation under a fixed memory budget. It introduces Q-Palette, a versatile suite of fractional-bit quantizers (NUQ, VQ, TCQ, Half-TCQ) with optimized CUDA kernels, and integrates them into a mixed-scheme quantization (MSQ) framework that can be tuned under memory and latency constraints. A fusion-aware extension of MSQ jointly optimizes quantizer choices and layer fusion, yielding improved accuracy-latency trade-offs. Empirical results on LLaMA and Qwen models show consistent gains over data-free and data-aware baselines, validating both the theoretical foundation and practical utility for edge-device deployment of LLMs.

Abstract

We study weight-only post-training quantization (PTQ), which quantizes the weights of a large language model (LLM) without retraining, using little or no calibration data. Weight-only PTQ is crucial for reducing the memory footprint and latency of LLM inference, especially in memory-bound, small-batch inference scenarios, such as personalized inference on edge devices. Despite its importance, irregular weight distributions with heavy-tailed outliers in LLMs complicate quantization, recently motivating rotation-based methods that transform weights into near-Gaussian distributions, which are more regular with fewer outliers, thereby reducing quantization error. In this work, we first derive the information-theoretically optimal bit allocation for Gaussianized weights under given bit budgets, revealing that fine-grained fractional-bit quantizers approaching the Gaussian distortion-rate bound are essential to achieve near-optimal quantization performance. To bridge this theoretical insight and practical implementation, we introduce Q-Palette, a versatile collection of fractional-bit quantizers that range from trellis-coded quantizers offering near-optimal distortion to simpler vector and scalar quantizers optimized for faster inference, all efficiently implemented with optimized CUDA kernels across various bitwidths. Furthermore, leveraging Q-Palette as a foundational component, we propose a novel mixed-scheme quantization framework, jointly optimizing quantizer choices and layer fusion decisions given resource constraints. The code is available at https://github.com/snu-mllab/Q-Palette.

Q-Palette: Fractional-Bit Quantizers Toward Optimal Bit Allocation for Efficient LLM Deployment

TL;DR

Abstract

Q-Palette: Fractional-Bit Quantizers Toward Optimal Bit Allocation for Efficient LLM Deployment

TL;DR

Abstract

Paper Structure

Table of Contents

Key Result

Figures (5)

Theorems & Definitions (4)