Table of Contents
Fetching ...

CacheGen: KV Cache Compression and Streaming for Fast Large Language Model Serving

Yuhan Liu, Hanchen Li, Yihua Cheng, Siddhant Ray, Yuyang Huang, Qizheng Zhang, Kuntai Du, Jiayi Yao, Shan Lu, Ganesh Ananthanarayanan, Michael Maire, Henry Hoffmann, Ari Holtzman, Junchen Jiang

TL;DR

CacheGen tackles the end-to-end latency of serving LLMs with long contexts by encoding KV caches into compact bitstreams and streaming them adaptively over bandwidth-varying networks. Its design leverages token-wise locality, layer-sensitive quantization, and arithmetic coding to drastically shrink KV-cache transmission size, while a bandwidth-aware streaming controller maintains TTFT within SLOs. Empirical results across multiple models and datasets show 3.1-4.7× TTFT reductions and 3.5-4.3× KV-cache size reductions, with negligible degradation in generation quality. The work demonstrates a practical, scalable approach to fast large-language-model serving in distributed environments, and complements existing context-compression and retrieval strategies.

Abstract

As large language models (LLMs) take on complex tasks, their inputs are supplemented with longer contexts that incorporate domain knowledge. Yet using long contexts is challenging, as nothing can be generated until the whole context is processed by the LLM. While the context-processing delay can be reduced by reusing the KV cache of a context across different inputs, fetching the KV cache, which contains large tensors, over the network can cause high extra network delays. CacheGen is a fast context-loading module for LLM systems. First, CacheGen uses a custom tensor encoder, leveraging KV cache's distributional properties to encode a KV cache into more compact bitstream representations with negligible decoding overhead, to save bandwidth usage. Second, CacheGen adapts the compression level of different parts of a KV cache to cope with changes in available bandwidth, in order to maintain low context-loading delay and high generation quality. % When available bandwidth drops, CacheGen may raise the compression level for a part of the context or recompute its KV cache on the fly. We test CacheGen on popular LLMs and datasets. Compared to the recent systems that reuse the KV cache, CacheGen reduces the KV cache size by 3.5-4.3x and the total delay in fetching and processing contexts by 3.2-3.7x with negligible impact on the LLM response quality. Our code is at: https://github.com/UChi-JCL/CacheGen.

CacheGen: KV Cache Compression and Streaming for Fast Large Language Model Serving

TL;DR

CacheGen tackles the end-to-end latency of serving LLMs with long contexts by encoding KV caches into compact bitstreams and streaming them adaptively over bandwidth-varying networks. Its design leverages token-wise locality, layer-sensitive quantization, and arithmetic coding to drastically shrink KV-cache transmission size, while a bandwidth-aware streaming controller maintains TTFT within SLOs. Empirical results across multiple models and datasets show 3.1-4.7× TTFT reductions and 3.5-4.3× KV-cache size reductions, with negligible degradation in generation quality. The work demonstrates a practical, scalable approach to fast large-language-model serving in distributed environments, and complements existing context-compression and retrieval strategies.

Abstract

As large language models (LLMs) take on complex tasks, their inputs are supplemented with longer contexts that incorporate domain knowledge. Yet using long contexts is challenging, as nothing can be generated until the whole context is processed by the LLM. While the context-processing delay can be reduced by reusing the KV cache of a context across different inputs, fetching the KV cache, which contains large tensors, over the network can cause high extra network delays. CacheGen is a fast context-loading module for LLM systems. First, CacheGen uses a custom tensor encoder, leveraging KV cache's distributional properties to encode a KV cache into more compact bitstream representations with negligible decoding overhead, to save bandwidth usage. Second, CacheGen adapts the compression level of different parts of a KV cache to cope with changes in available bandwidth, in order to maintain low context-loading delay and high generation quality. % When available bandwidth drops, CacheGen may raise the compression level for a part of the context or recompute its KV cache on the fly. We test CacheGen on popular LLMs and datasets. Compared to the recent systems that reuse the KV cache, CacheGen reduces the KV cache size by 3.5-4.3x and the total delay in fetching and processing contexts by 3.2-3.7x with negligible impact on the LLM response quality. Our code is at: https://github.com/UChi-JCL/CacheGen.
Paper Structure (30 sections, 19 figures, 2 tables, 1 algorithm)

This paper contains 30 sections, 19 figures, 2 tables, 1 algorithm.

Figures (19)

  • Figure 1: When the context is reused, CacheGen speeds up the sharing of its KV cache by compressing (encoding) the KV cache.
  • Figure 2: How different ways of loading context affect the network delay (to transfer context or KV cache) and the computation delay (to run the attention module on the context).
  • Figure 3: Contrasting the distribution of the original values and the delta values. We model two Llama models with various long contexts (§\ref{['subsec:insight']}). We show absolute values for clarity.
  • Figure 4: Applying data loss to different layers of a KV cache has different impact on accuracy. (Same workload as Figure \ref{['fig:locality']}).
  • Figure 5: Entropy (bits per element) when using different grouping strategies (Same workload as Figure \ref{['fig:locality']}.)
  • ...and 14 more figures