Table of Contents
Fetching ...

PolarQuant: Quantizing KV Caches with Polar Transformation

Insu Han, Praneeth Kacham, Amin Karbasi, Vahab Mirrokni, Amir Zandieh

TL;DR

PolarQuant tackles the memory bottleneck of KV caches in long-context autoregressive transformers by quantizing KV embeddings in polar coordinates after a random preconditioning step. The method uses a recursive polar transformation to produce angle coordinates, derives exact distributions for the polar angles under preconditioning, and proves an error bound showing reconstruction error scales as $\mathbb{E}[\|\mathbf{x}-\mathbf{x}'\|_2^2] = \varepsilon \|\mathbf{x}\|_2^2$ with $O(\log(1/\varepsilon))$ bits per coordinate. Applied to KV cache quantization, PolarQuant achieves memory compression of over $\times 4.2$ while attaining state-of-the-art or best reported quality on long-context benchmarks like LongBench. The work reduces the normalization overhead that limits many prior quantization methods and offers potential extensions to weight quantization and vector similarity search in large-scale models.

Abstract

Large language models (LLMs) require significant memory to store Key-Value (KV) embeddings in their KV cache, especially when handling long-range contexts. Quantization of these KV embeddings is a common technique to reduce memory consumption. This work introduces PolarQuant, a novel quantization method employing random preconditioning and polar transformation. Our method transforms the KV embeddings into polar coordinates using an efficient recursive algorithm and then quantizes resulting angles. Our key insight is that, after random preconditioning, the angles in the polar representation exhibit a tightly bounded and highly concentrated distribution with an analytically computable form. This nice distribution eliminates the need for explicit normalization, a step required by traditional quantization methods which introduces significant memory overhead because quantization parameters (e.g., zero point and scale) must be stored in full precision per each data block. PolarQuant bypasses this normalization step, enabling substantial memory savings. The long-context evaluation demonstrates that PolarQuant compresses the KV cache by over x4.2 while achieving the best quality scores compared to the state-of-the-art methods.

PolarQuant: Quantizing KV Caches with Polar Transformation

TL;DR

PolarQuant tackles the memory bottleneck of KV caches in long-context autoregressive transformers by quantizing KV embeddings in polar coordinates after a random preconditioning step. The method uses a recursive polar transformation to produce angle coordinates, derives exact distributions for the polar angles under preconditioning, and proves an error bound showing reconstruction error scales as with bits per coordinate. Applied to KV cache quantization, PolarQuant achieves memory compression of over while attaining state-of-the-art or best reported quality on long-context benchmarks like LongBench. The work reduces the normalization overhead that limits many prior quantization methods and offers potential extensions to weight quantization and vector similarity search in large-scale models.

Abstract

Large language models (LLMs) require significant memory to store Key-Value (KV) embeddings in their KV cache, especially when handling long-range contexts. Quantization of these KV embeddings is a common technique to reduce memory consumption. This work introduces PolarQuant, a novel quantization method employing random preconditioning and polar transformation. Our method transforms the KV embeddings into polar coordinates using an efficient recursive algorithm and then quantizes resulting angles. Our key insight is that, after random preconditioning, the angles in the polar representation exhibit a tightly bounded and highly concentrated distribution with an analytically computable form. This nice distribution eliminates the need for explicit normalization, a step required by traditional quantization methods which introduces significant memory overhead because quantization parameters (e.g., zero point and scale) must be stored in full precision per each data block. PolarQuant bypasses this normalization step, enabling substantial memory savings. The long-context evaluation demonstrates that PolarQuant compresses the KV cache by over x4.2 while achieving the best quality scores compared to the state-of-the-art methods.

Paper Structure

This paper contains 20 sections, 5 theorems, 37 equations, 3 figures, 2 tables, 1 algorithm.

Key Result

Lemma 1

For any positive integer $d$, if $x, y \ge 0$ are two i.i.d. random variables with generalized gamma distribution with probability density function $f_Z(z) = \frac{2}{2^{d/2} \cdot \Gamma(d/2)} z^{d-1} \exp\left( -z^2/2 \right)$, then the angle variable $\theta := \tan^{-1}(y / x)$ follows the proba Additionally, $\mathop{{\mathbb{E}}}[\Theta] = \pi/4$ and $\mathrm{Var}(\Theta) = O(1/\sqrt{d})$.

Figures (3)

  • Figure 1: Overview of recursive polar transformation procedure in \ref{['def_cartesian_to_polar']}
  • Figure 2: Distributions of angles of polar transformed key embeddings (a) with and (b) without random preconditioning. Preconditioning flattens the angle distribution and removes outliers which allows angle quantization more accurately.
  • Figure 3: Needle-In-A-Haystack test using $\mathtt{Llama}$-$\mathtt{3.1}$-$\mathtt{8B}$-$\mathtt{Instruct}$. The test spans different depths and context lengths ranging from 4K to 104K. Green/red colors indicate high/low recall scores (higher is better). PolarQuant shows the best performance.

Theorems & Definitions (10)

  • Lemma 1
  • Definition 1: Cartesian to Polar Transformation
  • Lemma 2: Distribution of a Gaussian Vector Under Polar Transformation
  • proof
  • Theorem 1
  • proof
  • Lemma 3
  • proof
  • Lemma 4
  • proof