Table of Contents
Fetching ...

PQCache: Product Quantization-based KVCache for Long Context LLM Inference

Hailin Zhang, Xiaodong Ji, Yilin Chen, Fangcheng Fu, Xupeng Miao, Xiaonan Nie, Weipeng Chen, Bin Cui

TL;DR

PQCache tackles the KVCache memory bottleneck in long-context LLM inference by reframing KVCache management as an embedding retrieval problem and applying Product Quantization to compress and fetch relevant key-value pairs. It demonstrates significant accuracy improvements (e.g., 4.60% on InfiniteBench) while maintaining low latency through overlapping computation and a GPU cache, and shows robustness across models, tasks, and PQ configurations. The approach is compatible with prefilling acceleration methods and scales to larger models, introducing a retrieval-centric paradigm that could become a standard component of next-generation LLM inference. Overall, PQCache provides a practical, efficient solution for long-context generation by combining PQ-based retrieval with careful system-algorithm co-design.

Abstract

As the field of Large Language Models (LLMs) continues to evolve, the context length in inference is steadily growing. Key-Value Cache (KVCache), the intermediate representations of tokens within LLM inference, has now become the primary memory bottleneck due to limited GPU memory. Current methods selectively determine suitable keys and values for self-attention computation in LLMs to address the issue. However, they either fall short in maintaining model quality or result in high serving latency. Drawing inspiration from advanced embedding retrieval techniques prevalent in the data management community, we consider the storage and retrieval of KVCache as a typical embedding retrieval problem. We propose PQCache, which employs Product Quantization (PQ) to manage KVCache, maintaining model quality while ensuring low serving latency. During the prefilling phase, we apply PQ to tokens' keys for each LLM layer and head. During the autoregressive decoding phase, we use PQ codes and centroids to approximately identify important preceding tokens, then fetch the corresponding key-value pairs for self-attention computation. Through meticulous design of overlapping and caching, we minimize any additional computation and communication overhead during both phases. Extensive experiments demonstrate that PQCache achieves both effectiveness and efficiency, with 4.60% score improvement over existing methods on InfiniteBench and low system latency in both prefilling and decoding.

PQCache: Product Quantization-based KVCache for Long Context LLM Inference

TL;DR

PQCache tackles the KVCache memory bottleneck in long-context LLM inference by reframing KVCache management as an embedding retrieval problem and applying Product Quantization to compress and fetch relevant key-value pairs. It demonstrates significant accuracy improvements (e.g., 4.60% on InfiniteBench) while maintaining low latency through overlapping computation and a GPU cache, and shows robustness across models, tasks, and PQ configurations. The approach is compatible with prefilling acceleration methods and scales to larger models, introducing a retrieval-centric paradigm that could become a standard component of next-generation LLM inference. Overall, PQCache provides a practical, efficient solution for long-context generation by combining PQ-based retrieval with careful system-algorithm co-design.

Abstract

As the field of Large Language Models (LLMs) continues to evolve, the context length in inference is steadily growing. Key-Value Cache (KVCache), the intermediate representations of tokens within LLM inference, has now become the primary memory bottleneck due to limited GPU memory. Current methods selectively determine suitable keys and values for self-attention computation in LLMs to address the issue. However, they either fall short in maintaining model quality or result in high serving latency. Drawing inspiration from advanced embedding retrieval techniques prevalent in the data management community, we consider the storage and retrieval of KVCache as a typical embedding retrieval problem. We propose PQCache, which employs Product Quantization (PQ) to manage KVCache, maintaining model quality while ensuring low serving latency. During the prefilling phase, we apply PQ to tokens' keys for each LLM layer and head. During the autoregressive decoding phase, we use PQ codes and centroids to approximately identify important preceding tokens, then fetch the corresponding key-value pairs for self-attention computation. Through meticulous design of overlapping and caching, we minimize any additional computation and communication overhead during both phases. Extensive experiments demonstrate that PQCache achieves both effectiveness and efficiency, with 4.60% score improvement over existing methods on InfiniteBench and low system latency in both prefilling and decoding.
Paper Structure (31 sections, 3 equations, 12 figures, 6 tables, 2 algorithms)

This paper contains 31 sections, 3 equations, 12 figures, 6 tables, 2 algorithms.

Figures (12)

  • Figure 1: KVCache memory size and theoretical CPU-GPU transfer latency over PCI-e Gen 5 for varying batch sizes (bs), model sizes (7B and 13B), and sequence lengths.
  • Figure 2: Comparison between information retrieval and LLM inference with selective attention.
  • Figure 3: An overview of LLM inference. The left part illustrates the computation process of the self-attention module, where "Q", "K", "V", "AS", and "O" represent query, key, value, attention score, and output, respectively. The right part depicts the LLM inference process, consisting of the prefilling phase and the decoding phase, where "Attn" and "FFN" represent the attention layer and the feed-forward network layer, respectively. The mathematical symbols are detailed in Table \ref{['tab:notations']}.
  • Figure 4: An overview of PQ construction and searching.
  • Figure 5: An overview of PQCache. For simplicity, we only illustrate the process for a single transformer layer.
  • ...and 7 more figures