Table of Contents
Fetching ...

Kitty: Accurate and Efficient 2-bit KV Cache Quantization with Dynamic Channel-wise Precision Boost

Haojun Xia, Xiaoxia Wu, Jisen Li, Robert Wu, Junxiong Wang, Jue Wang, Chenxi Li, Aman Singhal, Alay Dilipbhai Shah, Alpay Ariyak, Donglin Zhuang, Zhongzhu Zhou, Ben Athiwaratkun, Zhen Zheng, Shuaiwen Leon Song

TL;DR

Kitty addresses the KV cache memory bottleneck in long-context LLM inference by enabling aggressive $2$-bit quantization with minimal accuracy loss. It introduces Dynamic Channel-wise Precision Boost, which preserves a small subset of Key channels in higher precision ($INT4$) while quantizing the rest to $INT2$, together with preserving initial tokens in full precision, forming a practical mixed-precision approach. The system uses a page-centric KV layout with a Dense–Sparse decomposition and Triton-based dequantization kernels to maintain memory coalescing and low divergence, enabling end-to-end inference. Across Qwen3 and LLaMA3 models and long-context benchmarks, Kitty achieves up to $8\times$ memory reduction and throughput improvements of $2.1$–$4.1\times$ under the same memory budget, with accuracy approaching FP16; the authors also release the implementation.

Abstract

The KV cache is a dominant memory bottleneck for LLM inference. While 4-bit KV quantization preserves accuracy, 2-bit often degrades it, especially on long-context reasoning. We close this gap via an algorithm-system co-design for mixed-precision KV caching: Kitty. On the algorithm side, extensive experiments show that Dynamic Channel-wise Precision Boost -- which ranks Key-cache channels by sensitivity and keeps only a small fraction at higher precision -- maintains near-zero loss in accuracy drop while approaching 2-bit memory. The main challenge is handling dynamic 4-bit channel boosts while keeping the page layout coalesced and the dequantization uniform, with no scattered reads or hard-coded masks. Kitty addresses these issues by decompose each mixed-precision Key page into two tensors with unified 2-bit precision. Based on this, Kitty provides a page-centric KV layout, Triton-compatible page dequantization kernels, and a lightweight runtime pipeline that preserves coalescing and avoids divergence. Across seven tasks and two model families (Qwen3, LLaMA3), Kitty cuts KV memory by nearly 8x with negligible accuracy loss, enabling up to 8x larger batches and 2.1x-4.1x higher throughput under the same memory budget. We release the full implementation of Kitty at https://github.com/Summer-Summer/Kitty.

Kitty: Accurate and Efficient 2-bit KV Cache Quantization with Dynamic Channel-wise Precision Boost

TL;DR

Kitty addresses the KV cache memory bottleneck in long-context LLM inference by enabling aggressive -bit quantization with minimal accuracy loss. It introduces Dynamic Channel-wise Precision Boost, which preserves a small subset of Key channels in higher precision () while quantizing the rest to , together with preserving initial tokens in full precision, forming a practical mixed-precision approach. The system uses a page-centric KV layout with a Dense–Sparse decomposition and Triton-based dequantization kernels to maintain memory coalescing and low divergence, enabling end-to-end inference. Across Qwen3 and LLaMA3 models and long-context benchmarks, Kitty achieves up to memory reduction and throughput improvements of under the same memory budget, with accuracy approaching FP16; the authors also release the implementation.

Abstract

The KV cache is a dominant memory bottleneck for LLM inference. While 4-bit KV quantization preserves accuracy, 2-bit often degrades it, especially on long-context reasoning. We close this gap via an algorithm-system co-design for mixed-precision KV caching: Kitty. On the algorithm side, extensive experiments show that Dynamic Channel-wise Precision Boost -- which ranks Key-cache channels by sensitivity and keeps only a small fraction at higher precision -- maintains near-zero loss in accuracy drop while approaching 2-bit memory. The main challenge is handling dynamic 4-bit channel boosts while keeping the page layout coalesced and the dequantization uniform, with no scattered reads or hard-coded masks. Kitty addresses these issues by decompose each mixed-precision Key page into two tensors with unified 2-bit precision. Based on this, Kitty provides a page-centric KV layout, Triton-compatible page dequantization kernels, and a lightweight runtime pipeline that preserves coalescing and avoids divergence. Across seven tasks and two model families (Qwen3, LLaMA3), Kitty cuts KV memory by nearly 8x with negligible accuracy loss, enabling up to 8x larger batches and 2.1x-4.1x higher throughput under the same memory budget. We release the full implementation of Kitty at https://github.com/Summer-Summer/Kitty.

Paper Structure

This paper contains 27 sections, 2 equations, 5 figures, 4 tables, 1 algorithm.

Figures (5)

  • Figure 1: Illustration of our KV cache quantization scheme, Kitty. The figure shows the decoding phase, where only the current query vector (a single token) participates in computation, while all previously stored key and value vectors in the KV cache are reused. The key cache is organized into three parts: (i) Sink (initial tokens kept in FP16), (ii) a Q-Buffer (quantization buffer, temporarily storing the FP16 KVs before forming a quantization group), and (iii) quantized channels (Most channels quantized to INT2 for maximum compression, a small fraction of channels preserved in INT4 for accuracy preservation). The value cache is quantized per-token with a sliding window, where both Sink (initial tokens) and the most recent Local (local tokens) are retained in FP16. Here, $L$ denotes the sequence length and $D$ the head size; $S$ is the number of preserved sink tokens, $G$ the quantization group size, and $R$ the local window size. The default configuration is: $S=32$, $R=128$, $G=128$, which provides a good balance between accuracy and memory savings.
  • Figure 2: Visual and statistical analysis of Key cache from Layer 10, Qwen3-8B. (a) Visualization of Key-cache magnitude from the first KV head. The uneven distribution, with a few channels showing consistently high magnitudes, motivates a channel-aware approach to quantization. The vertical axis denotes activation magnitudes, while the horizontal axis spans the token and channel dimensions. (b) The mean squared error (MSE) between the original attention score matrix with its perturbed counterpart after quantizing each channel of Key cache. The pattern is consistent between different Q heads who share the same Key cache due to grouped-query attention GQA.) Similar patterns are observed on other layers/models.
  • Figure 3: Illustration of Kitty’s page-centric KV cache layout.
  • Figure 4: Accuracy recovery with channel-wise precision boost on Qwen3-8B. Similar trends are observed on other tasks.
  • Figure 5: Memory usage and throughput comparison on Qwen3-8B generating 8192 tokens. Kitty can achieve higher throughput by enabling larger batch sizes.