Table of Contents
Fetching ...

TurboAngle: Near-Lossless KV Cache Compression via Uniform Angle Quantization

Dipkumar Patel

Abstract

We compress KV cache entries by quantizing angles in the Fast Walsh-Hadamard domain, where a random diagonal rotation makes consecutive element pairs approximately uniformly distributed on the unit circle. We extend this angular quantizer with per-layer early-boost, which independently configures K and V codebook sizes at each layer, allocating higher precision to a model-specific subset of critical layers. Across seven models (1B to 7B parameters), per-layer early-boost achieves lossless compression on four models and near-lossless quality on six of seven, at 3.28 to 3.67 angle bits per element. Asymmetric norm quantization (8-bit for keys, 4-bit log-space for values) yields 6.56 total bits per element on Mistral-7B with perplexity degradation of +0.0014 and no calibration data. A layer-group sensitivity analysis reveals model-specific bottleneck patterns, including K-dominated versus V-dominated layers and negative-transfer layers where increased precision degrades quality.

TurboAngle: Near-Lossless KV Cache Compression via Uniform Angle Quantization

Abstract

We compress KV cache entries by quantizing angles in the Fast Walsh-Hadamard domain, where a random diagonal rotation makes consecutive element pairs approximately uniformly distributed on the unit circle. We extend this angular quantizer with per-layer early-boost, which independently configures K and V codebook sizes at each layer, allocating higher precision to a model-specific subset of critical layers. Across seven models (1B to 7B parameters), per-layer early-boost achieves lossless compression on four models and near-lossless quality on six of seven, at 3.28 to 3.67 angle bits per element. Asymmetric norm quantization (8-bit for keys, 4-bit log-space for values) yields 6.56 total bits per element on Mistral-7B with perplexity degradation of +0.0014 and no calibration data. A layer-group sensitivity analysis reveals model-specific bottleneck patterns, including K-dominated versus V-dominated layers and negative-transfer layers where increased precision degrades quality.

Paper Structure

This paper contains 29 sections, 3 equations, 1 figure, 6 tables, 1 algorithm.

Figures (1)

  • Figure 1: TurboAngle pipeline. Top: the compression path applies a random diagonal rotation $D$, the normalized FWHT $H$, polar decomposition of consecutive pairs, and uniform angle quantization on $S^1$, storing angle indices $k_i$ and norms $r_i$. Bottom: reconstruction maps $(k_i, r_i)$ back to Cartesian coordinates via trigonometric lookup, then applies the inverse FWHT to recover the approximate KV vector.