Table of Contents
Fetching ...

Q-Palette: Fractional-Bit Quantizers Toward Optimal Bit Allocation for Efficient LLM Deployment

Deokjae Lee, Hyun Oh Song

TL;DR

This work tackles weight-only post-training quantization (PTQ) for large language models in memory-constrained environments by Gaussianizing weight distributions and deriving an information-theoretic optimal fractional-bit allocation under a fixed memory budget. It introduces Q-Palette, a versatile suite of fractional-bit quantizers (NUQ, VQ, TCQ, Half-TCQ) with optimized CUDA kernels, and integrates them into a mixed-scheme quantization (MSQ) framework that can be tuned under memory and latency constraints. A fusion-aware extension of MSQ jointly optimizes quantizer choices and layer fusion, yielding improved accuracy-latency trade-offs. Empirical results on LLaMA and Qwen models show consistent gains over data-free and data-aware baselines, validating both the theoretical foundation and practical utility for edge-device deployment of LLMs.

Abstract

We study weight-only post-training quantization (PTQ), which quantizes the weights of a large language model (LLM) without retraining, using little or no calibration data. Weight-only PTQ is crucial for reducing the memory footprint and latency of LLM inference, especially in memory-bound, small-batch inference scenarios, such as personalized inference on edge devices. Despite its importance, irregular weight distributions with heavy-tailed outliers in LLMs complicate quantization, recently motivating rotation-based methods that transform weights into near-Gaussian distributions, which are more regular with fewer outliers, thereby reducing quantization error. In this work, we first derive the information-theoretically optimal bit allocation for Gaussianized weights under given bit budgets, revealing that fine-grained fractional-bit quantizers approaching the Gaussian distortion-rate bound are essential to achieve near-optimal quantization performance. To bridge this theoretical insight and practical implementation, we introduce Q-Palette, a versatile collection of fractional-bit quantizers that range from trellis-coded quantizers offering near-optimal distortion to simpler vector and scalar quantizers optimized for faster inference, all efficiently implemented with optimized CUDA kernels across various bitwidths. Furthermore, leveraging Q-Palette as a foundational component, we propose a novel mixed-scheme quantization framework, jointly optimizing quantizer choices and layer fusion decisions given resource constraints. The code is available at https://github.com/snu-mllab/Q-Palette.

Q-Palette: Fractional-Bit Quantizers Toward Optimal Bit Allocation for Efficient LLM Deployment

TL;DR

This work tackles weight-only post-training quantization (PTQ) for large language models in memory-constrained environments by Gaussianizing weight distributions and deriving an information-theoretic optimal fractional-bit allocation under a fixed memory budget. It introduces Q-Palette, a versatile suite of fractional-bit quantizers (NUQ, VQ, TCQ, Half-TCQ) with optimized CUDA kernels, and integrates them into a mixed-scheme quantization (MSQ) framework that can be tuned under memory and latency constraints. A fusion-aware extension of MSQ jointly optimizes quantizer choices and layer fusion, yielding improved accuracy-latency trade-offs. Empirical results on LLaMA and Qwen models show consistent gains over data-free and data-aware baselines, validating both the theoretical foundation and practical utility for edge-device deployment of LLMs.

Abstract

We study weight-only post-training quantization (PTQ), which quantizes the weights of a large language model (LLM) without retraining, using little or no calibration data. Weight-only PTQ is crucial for reducing the memory footprint and latency of LLM inference, especially in memory-bound, small-batch inference scenarios, such as personalized inference on edge devices. Despite its importance, irregular weight distributions with heavy-tailed outliers in LLMs complicate quantization, recently motivating rotation-based methods that transform weights into near-Gaussian distributions, which are more regular with fewer outliers, thereby reducing quantization error. In this work, we first derive the information-theoretically optimal bit allocation for Gaussianized weights under given bit budgets, revealing that fine-grained fractional-bit quantizers approaching the Gaussian distortion-rate bound are essential to achieve near-optimal quantization performance. To bridge this theoretical insight and practical implementation, we introduce Q-Palette, a versatile collection of fractional-bit quantizers that range from trellis-coded quantizers offering near-optimal distortion to simpler vector and scalar quantizers optimized for faster inference, all efficiently implemented with optimized CUDA kernels across various bitwidths. Furthermore, leveraging Q-Palette as a foundational component, we propose a novel mixed-scheme quantization framework, jointly optimizing quantizer choices and layer fusion decisions given resource constraints. The code is available at https://github.com/snu-mllab/Q-Palette.

Paper Structure

This paper contains 71 sections, 2 theorems, 26 equations, 5 figures, 10 tables.

Key Result

Theorem 3.1

If the budget $M$ is feasible, i.e., $M\ge \eta \sum_{l=1}^L d_l^\text{in} d_l^\text{out}$, then the optimal fractional bit allocation $\{b_l^*\}$ for problem eq:mpq_frac is given by for the constant $C$ that satisfies the memory constraint $\sum_l b_l^* d_l^{\mathrm{in}} d_l^{\mathrm{out}} = M$.

Figures (5)

  • Figure 1: Qualitative comparison of quantization frameworks based on Q-Palette against the NormalFloat baseline with FLUTE kernels normalfloatflute, evaluated on the LLaMA 3.1-8B model using an RTX4090 GPU with a batch size of 1. Compared to (a) NormalFloat, (b) single-scheme quantization with TCQ-3.25 achieves a 17% inference speedup, (c) MSQ with Q-Palette provides a 28% speedup, and (d) fusion-aware MSQ further yields a 36% speedup alongside reduced WikiText2 perplexity, highlighting the practical effectiveness of Q-Palette and our MSQ framework. In the MSQ visualizations, columns represent transformer blocks, and rows represent linear layers, with colors indicating selected quantization bitwidths. The right visualization illustrates fused layers within the $30$-th transformer block from configuration (d). Refer to \ref{['app:fig1_settings']} for the experimental details.
  • Figure 2: Gaussian quantization error of Q-Palette quantizers (NUQ, VQ, TCQ) compared to the uniform baseline.
  • Figure 3: Performance comparison of memory-constrained MSQ for different quantizer sets in Q-Palette on LLaMA 3.1-8B.
  • Figure 4: Performance trade-offs of quantized LLaMA 3.1-8B models under different constraints in the data-free setting on an RTX 4090 GPU: (a) memory constraint; (b) latency constraint (single batch); (c) throughput evaluation (batch size $= 8$) of the quantized models in (b).
  • Figure 5: Mixed-scheme quantization results on Qwen 2.5-7B and LLaMA 3.1-70B models. To accommodate the broader sensitivity range in LLaMA 3.1-70B, we extended the quantizer set to include higher-bitwidth options (NUQ 7/8 bits and VQ 5.5/6 bits), in addition to the TCQ quantizers.

Theorems & Definitions (4)

  • Theorem 3.1: Optimal bit allocation with ideal Gaussian quantizers
  • proof
  • Theorem A.1: Optimal bit allocation with ideal Gaussian quantizers
  • proof