Table of Contents
Fetching ...

KurTail : Kurtosis-based LLM Quantization

Mohammad Sadegh Akhondzadeh, Aleksandar Bojchevski, Evangelos Eleftheriou, Martino Dazzi

TL;DR

This paper tackles the outlier problem in activation quantization for large language models by introducing KurTail, a kurtosis-based rotation method that learns orthogonal transformations per layer to flatten activation distributions toward a uniform shape. The approach enables robust 4-bit quantization of weights, activations, and KV caches with a layer-wise, memory-efficient pipeline, requiring only a single consumer GPU in many cases. Empirically, KurTail improves downstream tasks such as MMLU and Wiki perplexity compared to QuaRot and SpinQuant, including in mixture-of-experts models, while reducing training cost. The work highlights the practical impact of learnable, kurtosis-driven rotations for efficient LLM deployment, and points to future work on static quantization to push efficiency further.

Abstract

One of the challenges of quantizing a large language model (LLM) is the presence of outliers. Outliers often make uniform quantization schemes less effective, particularly in extreme cases such as 4-bit quantization. We introduce KurTail, a new post-training quantization (PTQ) scheme that leverages Kurtosis-based rotation to mitigate outliers in the activations of LLMs. Our method optimizes Kurtosis as a measure of tailedness. This approach enables the quantization of weights, activations, and the KV cache in 4 bits. We utilize layer-wise optimization, ensuring memory efficiency. KurTail outperforms existing quantization methods, offering a 13.3\% boost in MMLU accuracy and a 15.5\% drop in Wiki perplexity compared to QuaRot. It also outperforms SpinQuant with a 2.6\% MMLU gain and reduces perplexity by 2.9\%, all while reducing the training cost. For comparison, learning the rotation using SpinQuant for Llama3-70B requires at least four NVIDIA H100 80GB GPUs, whereas our method requires only a single GPU, making it a more accessible solution for consumer GPU.

KurTail : Kurtosis-based LLM Quantization

TL;DR

This paper tackles the outlier problem in activation quantization for large language models by introducing KurTail, a kurtosis-based rotation method that learns orthogonal transformations per layer to flatten activation distributions toward a uniform shape. The approach enables robust 4-bit quantization of weights, activations, and KV caches with a layer-wise, memory-efficient pipeline, requiring only a single consumer GPU in many cases. Empirically, KurTail improves downstream tasks such as MMLU and Wiki perplexity compared to QuaRot and SpinQuant, including in mixture-of-experts models, while reducing training cost. The work highlights the practical impact of learnable, kurtosis-driven rotations for efficient LLM deployment, and points to future work on static quantization to push efficiency further.

Abstract

One of the challenges of quantizing a large language model (LLM) is the presence of outliers. Outliers often make uniform quantization schemes less effective, particularly in extreme cases such as 4-bit quantization. We introduce KurTail, a new post-training quantization (PTQ) scheme that leverages Kurtosis-based rotation to mitigate outliers in the activations of LLMs. Our method optimizes Kurtosis as a measure of tailedness. This approach enables the quantization of weights, activations, and the KV cache in 4 bits. We utilize layer-wise optimization, ensuring memory efficiency. KurTail outperforms existing quantization methods, offering a 13.3\% boost in MMLU accuracy and a 15.5\% drop in Wiki perplexity compared to QuaRot. It also outperforms SpinQuant with a 2.6\% MMLU gain and reduces perplexity by 2.9\%, all while reducing the training cost. For comparison, learning the rotation using SpinQuant for Llama3-70B requires at least four NVIDIA H100 80GB GPUs, whereas our method requires only a single GPU, making it a more accessible solution for consumer GPU.

Paper Structure

This paper contains 23 sections, 1 theorem, 4 equations, 3 figures, 10 tables.

Key Result

Theorem 2.2

kurtosis_weight Considering ${\bm{x}}_U$ and ${\bm{x}}_N$ be continuous random variables with uniform and normal distributions. Then, for any given $\varepsilon > 0$, the quantization sensitivity $\Gamma({\bm{x}}, \varepsilon)$ satisfies $\Gamma({\bm{x}}_U, \varepsilon) < \Gamma({\bm{x}}_N, \varepsi

Figures (3)

  • Figure 1: Empirical sensitivity of the MHSA input distribution across different rotations. $\alpha$ indicates the fraction of the optimal step size used to analyze the sensitivity of quantization.
  • Figure 2: The input distribution of the MHSA and FFN blocks in the LLaMA3-8B model is shown before and after applying KurTail . Before rotation, some channels have noticeable outliers, which can disrupt the data balance. The rotated distribution allows for more accurate token-wise qunatization.
  • Figure 3: Diagram of a single-layer decoder network after applying rotations. Each block represents a computation unit. Blocks containing both blue and black indicate that the rotation is fused into the network without adding extra computation. In contrast, blocks with only the rotation signify additional computations during inference.

Theorems & Definitions (2)

  • Definition 2.1
  • Theorem 2.2