Table of Contents
Fetching ...

VecInfer: Efficient LLM Inference with Low-Bit KV Cache via Outlier-Suppressed Vector Quantization

Dingyu Yao, Chenxu Yang, Zhengyang Tong, Zheng Lin, Wei Liu, Jian Luan, Weiping Wang

TL;DR

VecInfer addresses the KV cache memory burden in LLM inference by suppressing key outliers through a dual smooth-Hadamard transformation and applying vector quantization with pre-trained codebooks. The method is paired with a hardware-friendly fused CUDA kernel that dequantizes and computes in one pass, delivering substantial speedups and maintaining accuracy at low bit-widths (notably $2$-bit). Key contributions include the dual transformation to reduce quantization difficulty, task-independent codebooks, and a CUDA kernel design that minimizes memory traffic. The results show strong accuracy and efficiency gains across long-context and complex reasoning tasks on multiple models, enabling scalable deployment of LLMs on resource-constrained GPUs with practical latency reductions.

Abstract

The Key-Value (KV) cache introduces substantial memory overhead during large language model (LLM) inference. Although existing vector quantization (VQ) methods reduce KV cache usage and provide flexible representational capacity across bit-widths, they suffer severe performance degradation at ultra-low bit-widths due to key cache outliers that hinder effective codebook utilization. To address this challenge, we propose VecInfer, a novel VQ method for aggressive KV cache compression while enabling efficient inference. By applying smooth and Hadamard transformations, VecInfer suppresses outliers in the key cache, enabling the codebook to comprehensively cover the original data distribution and thereby reducing quantization difficulty. To facilitate efficient deployment, we design an optimized CUDA kernel that fuses computation with dequantization to minimize memory access overhead. Extensive evaluations demonstrate that VecInfer consistently outperforms existing quantization baselines across both long-context understanding and mathematical reasoning tasks. With only 2-bit quantization, VecInfer achieves performance comparable to full precision, while delivering up to $\mathbf{2.7\times}$ speedup in large-batch self-attention computation and $\mathbf{8.3\times}$ reduction in single-batch end-to-end latency on Llama-3.1-8B with a 196k sequence length.

VecInfer: Efficient LLM Inference with Low-Bit KV Cache via Outlier-Suppressed Vector Quantization

TL;DR

VecInfer addresses the KV cache memory burden in LLM inference by suppressing key outliers through a dual smooth-Hadamard transformation and applying vector quantization with pre-trained codebooks. The method is paired with a hardware-friendly fused CUDA kernel that dequantizes and computes in one pass, delivering substantial speedups and maintaining accuracy at low bit-widths (notably -bit). Key contributions include the dual transformation to reduce quantization difficulty, task-independent codebooks, and a CUDA kernel design that minimizes memory traffic. The results show strong accuracy and efficiency gains across long-context and complex reasoning tasks on multiple models, enabling scalable deployment of LLMs on resource-constrained GPUs with practical latency reductions.

Abstract

The Key-Value (KV) cache introduces substantial memory overhead during large language model (LLM) inference. Although existing vector quantization (VQ) methods reduce KV cache usage and provide flexible representational capacity across bit-widths, they suffer severe performance degradation at ultra-low bit-widths due to key cache outliers that hinder effective codebook utilization. To address this challenge, we propose VecInfer, a novel VQ method for aggressive KV cache compression while enabling efficient inference. By applying smooth and Hadamard transformations, VecInfer suppresses outliers in the key cache, enabling the codebook to comprehensively cover the original data distribution and thereby reducing quantization difficulty. To facilitate efficient deployment, we design an optimized CUDA kernel that fuses computation with dequantization to minimize memory access overhead. Extensive evaluations demonstrate that VecInfer consistently outperforms existing quantization baselines across both long-context understanding and mathematical reasoning tasks. With only 2-bit quantization, VecInfer achieves performance comparable to full precision, while delivering up to speedup in large-batch self-attention computation and reduction in single-batch end-to-end latency on Llama-3.1-8B with a 196k sequence length.

Paper Structure

This paper contains 42 sections, 20 equations, 11 figures, 10 tables.

Figures (11)

  • Figure 1: Key cache distribution and codebook representation for Llama-3.1-8B-Instruct at layer 16. (a) Dual transformation reduces channel-wise variation and suppresses outliers, resulting in a more uniform distribution. (b) This uniformity facilitates task-independent codebook representations and ensures comprehensive coverage of the original data distribution.
  • Figure 2: Typical vector quantization pipeline.
  • Figure 3: Transformation from $\mathbf{V}^\top$ to $\mathbf{A}$ via SVD.
  • Figure 4: Overview of VecInfer. During inference, dual transformation is applied before vector quantization.
  • Figure 5: Left: Attention kernel speed comparison between VecInfer and the non-fused baseline on H100. Right: Workflow of the VecInfer kernel with fine-grained tiled computation and asynchronous pipeline execution.
  • ...and 6 more figures

Theorems & Definitions (1)

  • proof