KurTail : Kurtosis-based LLM Quantization

Mohammad Sadegh Akhondzadeh; Aleksandar Bojchevski; Evangelos Eleftheriou; Martino Dazzi

KurTail : Kurtosis-based LLM Quantization

Mohammad Sadegh Akhondzadeh, Aleksandar Bojchevski, Evangelos Eleftheriou, Martino Dazzi

TL;DR

This paper tackles the outlier problem in activation quantization for large language models by introducing KurTail, a kurtosis-based rotation method that learns orthogonal transformations per layer to flatten activation distributions toward a uniform shape. The approach enables robust 4-bit quantization of weights, activations, and KV caches with a layer-wise, memory-efficient pipeline, requiring only a single consumer GPU in many cases. Empirically, KurTail improves downstream tasks such as MMLU and Wiki perplexity compared to QuaRot and SpinQuant, including in mixture-of-experts models, while reducing training cost. The work highlights the practical impact of learnable, kurtosis-driven rotations for efficient LLM deployment, and points to future work on static quantization to push efficiency further.

Abstract

One of the challenges of quantizing a large language model (LLM) is the presence of outliers. Outliers often make uniform quantization schemes less effective, particularly in extreme cases such as 4-bit quantization. We introduce KurTail, a new post-training quantization (PTQ) scheme that leverages Kurtosis-based rotation to mitigate outliers in the activations of LLMs. Our method optimizes Kurtosis as a measure of tailedness. This approach enables the quantization of weights, activations, and the KV cache in 4 bits. We utilize layer-wise optimization, ensuring memory efficiency. KurTail outperforms existing quantization methods, offering a 13.3\% boost in MMLU accuracy and a 15.5\% drop in Wiki perplexity compared to QuaRot. It also outperforms SpinQuant with a 2.6\% MMLU gain and reduces perplexity by 2.9\%, all while reducing the training cost. For comparison, learning the rotation using SpinQuant for Llama3-70B requires at least four NVIDIA H100 80GB GPUs, whereas our method requires only a single GPU, making it a more accessible solution for consumer GPU.

KurTail : Kurtosis-based LLM Quantization

TL;DR

Abstract

KurTail : Kurtosis-based LLM Quantization

TL;DR

Abstract

Paper Structure

Table of Contents

Key Result

Figures (3)

Theorems & Definitions (2)