Table of Contents
Fetching ...

KVLinC : KV Cache Quantization with Hadamard Rotation and Linear Correction

Utkarsh Saxena, Kaushik Roy

TL;DR

The paper tackles the memory and bandwidth challenges of KV-cache inference in large language models under extreme low-precision quantization. It introduces KVLinC, a framework that combines Hadamard rotation-based quantization (optimizing channel-wise vs token-wise axes) with trainable linear correction adapters to explicitly compensate for quantization-induced attention distortions. Empirical results across Llama and Qwen model families show that KVLinC matches or surpasses strong baselines at 2-bit precision, with notable gains on smaller models and instruction-tuned tasks, while a custom Triton kernel delivers substantial end-to-end speedups and enables larger batch sizes. This work demonstrates practical, scalable long-context inference with aggressive KV-cache compression, reducing memory footprint and latency without sacrificing accuracy.

Abstract

Quantizing the key-value (KV) cache is a promising strategy for improving the inference efficiency of large language models (LLMs). However, aggressive quantization to very low precision (e.g., 2 bits) introduces significant errors in the stored key and value tensors, which propagate through the dot-product attention mechanism and ultimately degrade generation quality. To address this, we propose KVLinC, a framework to mitigate attention errors introduced by KV cache quantization in the extreme low-precision regime. KVLinC combines a Hadamard rotation, which reduces quantization error in values, with lightweight linear correction adapters that explicitly compensate for errors introduced by quantized keys. Across extensive evaluations on the LLaMA, Qwen2.5, and Qwen3 model families, KVLinC consistently matches or surpasses strong baselines while achieving higher KV-cache compression. Furthermore, we implement a custom attention kernel that results in upto 2.55x faster inference compared to Flash Attention baseline, enabling efficient long-context LLM inference.

KVLinC : KV Cache Quantization with Hadamard Rotation and Linear Correction

TL;DR

The paper tackles the memory and bandwidth challenges of KV-cache inference in large language models under extreme low-precision quantization. It introduces KVLinC, a framework that combines Hadamard rotation-based quantization (optimizing channel-wise vs token-wise axes) with trainable linear correction adapters to explicitly compensate for quantization-induced attention distortions. Empirical results across Llama and Qwen model families show that KVLinC matches or surpasses strong baselines at 2-bit precision, with notable gains on smaller models and instruction-tuned tasks, while a custom Triton kernel delivers substantial end-to-end speedups and enables larger batch sizes. This work demonstrates practical, scalable long-context inference with aggressive KV-cache compression, reducing memory footprint and latency without sacrificing accuracy.

Abstract

Quantizing the key-value (KV) cache is a promising strategy for improving the inference efficiency of large language models (LLMs). However, aggressive quantization to very low precision (e.g., 2 bits) introduces significant errors in the stored key and value tensors, which propagate through the dot-product attention mechanism and ultimately degrade generation quality. To address this, we propose KVLinC, a framework to mitigate attention errors introduced by KV cache quantization in the extreme low-precision regime. KVLinC combines a Hadamard rotation, which reduces quantization error in values, with lightweight linear correction adapters that explicitly compensate for errors introduced by quantized keys. Across extensive evaluations on the LLaMA, Qwen2.5, and Qwen3 model families, KVLinC consistently matches or surpasses strong baselines while achieving higher KV-cache compression. Furthermore, we implement a custom attention kernel that results in upto 2.55x faster inference compared to Flash Attention baseline, enabling efficient long-context LLM inference.

Paper Structure

This paper contains 17 sections, 7 equations, 8 figures, 6 tables, 1 algorithm.

Figures (8)

  • Figure 1: Token-wise and channel-wise quantization grouping.
  • Figure 2: Distribution of key and values with and without Hadamard rotation for Qwen-2.5-3B layer 16 head 0.
  • Figure 3: Wikitext perplexity under different 2-bit quantization configuration for key and values. Perplexity values are clipped to $500$. Quantizing raw keys channel-wise and quantizing Hadamard rotated values token-wise achieves best performance (shown in red).
  • Figure 3: Performance with applying Hadamard rotation and linear correction in isolation on Llama family. $\uparrow$ higher is better, $\downarrow$: lower is better.
  • Figure 4: Layer-wise scaling factor for different quantization configuration of keys.
  • ...and 3 more figures