Table of Contents
Fetching ...

MILLION: Mastering Long-Context LLM Inference Via Outlier-Immunized KV Product Quantization

Zongwu Wang, Peng Xu, Fangxin Liu, Yiwei Hu, Qingxiao Sun, Gezi Li, Cheng Li, Xuan Wang, Li Jiang, Haibing Guan

TL;DR

MILLION tackles the memory and latency bottlenecks of long-context LLM inference by compressing KV caches with a non-uniform product-quantization scheme that is robust to outliers. It first analyzes KV distributions to motivate non-uniform PQ, then develops an offline codebook training plus online prefill/decode flow with asynchronous quantization to avoid dequantization bottlenecks. A high-performance GPU inference framework leverages sparse computations and specialized kernels to realize up to about 2x end-to-end speedups at 32K context with 4-bit quantization while preserving perplexity. The work demonstrates that MILLION can achieve substantial memory savings and speedups with minimal accuracy loss, and provides code for reproducibility.

Abstract

Large language models (LLMs) are increasingly utilized for complex tasks requiring longer context lengths, with some models supporting up to 128K or 1M tokens. This trend, however, presents significant challenges in inference speed and memory management. Quantization emerges as a promising approach to address the widening gap between LLM size and memory capacity. However, traditional quantization schemes often yield suboptimal compression results for KV caches due to two key factors: i) On-the-fly quantization and de-quantization, causing significant performance overhead; ii) Prevalence of outliers in KV values, challenging low-bitwidth uniform quantization. To this end, we propose MILLION, a novel quantization framework achieving low-bitwidth KV cache through product quantization. First, we conduct a thorough analysis of KV cache distribution, revealing the limitations of existing quantization schemes. Second, we introduce a non-uniform quantization algorithm based on product quantization, which efficiently compresses data while preserving accuracy. Third, we develop a high-performance GPU inference framework with efficient attention kernel and pipeline design for MILLION that leverages sparse computation and asynchronous quantization, significantly enhancing inference speed. Comprehensive evaluation results demonstrate that MILLION can achieve 4 bits quantization with trivial perplexity and accuracy loss, and achieve 2.09x end-to-end performance gains at 32K context length. Code is released at https://github.com/ZongwuWang/MILLION.

MILLION: Mastering Long-Context LLM Inference Via Outlier-Immunized KV Product Quantization

TL;DR

MILLION tackles the memory and latency bottlenecks of long-context LLM inference by compressing KV caches with a non-uniform product-quantization scheme that is robust to outliers. It first analyzes KV distributions to motivate non-uniform PQ, then develops an offline codebook training plus online prefill/decode flow with asynchronous quantization to avoid dequantization bottlenecks. A high-performance GPU inference framework leverages sparse computations and specialized kernels to realize up to about 2x end-to-end speedups at 32K context with 4-bit quantization while preserving perplexity. The work demonstrates that MILLION can achieve substantial memory savings and speedups with minimal accuracy loss, and provides code for reproducibility.

Abstract

Large language models (LLMs) are increasingly utilized for complex tasks requiring longer context lengths, with some models supporting up to 128K or 1M tokens. This trend, however, presents significant challenges in inference speed and memory management. Quantization emerges as a promising approach to address the widening gap between LLM size and memory capacity. However, traditional quantization schemes often yield suboptimal compression results for KV caches due to two key factors: i) On-the-fly quantization and de-quantization, causing significant performance overhead; ii) Prevalence of outliers in KV values, challenging low-bitwidth uniform quantization. To this end, we propose MILLION, a novel quantization framework achieving low-bitwidth KV cache through product quantization. First, we conduct a thorough analysis of KV cache distribution, revealing the limitations of existing quantization schemes. Second, we introduce a non-uniform quantization algorithm based on product quantization, which efficiently compresses data while preserving accuracy. Third, we develop a high-performance GPU inference framework with efficient attention kernel and pipeline design for MILLION that leverages sparse computation and asynchronous quantization, significantly enhancing inference speed. Comprehensive evaluation results demonstrate that MILLION can achieve 4 bits quantization with trivial perplexity and accuracy loss, and achieve 2.09x end-to-end performance gains at 32K context length. Code is released at https://github.com/ZongwuWang/MILLION.

Paper Structure

This paper contains 15 sections, 7 equations, 9 figures, 2 tables.

Figures (9)

  • Figure 1: Overview of transformer model structure. (a) Transformer model consists of two stage, the prefill stage processes the prompt prefilling of all tokens in batches, and the decode stage generates tokens one by one; (2) Details of the attention block.
  • Figure 2: Magnitude distribution of key and value cache for Llama-2-13B and Falcon-7B.
  • Figure 3: channel-wise standard deviation distribution of key and value cache for Llama-2-13B and Falcon-7B.
  • Figure 4: An overview of MILLION algorithm framework. MILLION's algorithm framework consists of three parts: 1) offline PQ centroid training; 2) Prefill stage with KV quantization; 3) Decode stage with KV quantization.
  • Figure 5: Detail model information for evaluation.
  • ...and 4 more figures