Table of Contents
Fetching ...

Compute Or Load KV Cache? Why Not Both?

Shuowei Jin, Xueshen Liu, Qingzhao Zhang, Z. Morley Mao

TL;DR

Cake addresses the latency bottleneck of KV cache prefill in long-context LLM inference by balancing computation and I/O through a bidirectional, chunk-based loading strategy. The approach jointly leverages compute from the sequence head and I/O from the tail, with adaptive scheduling to accommodate non-prefix requests and fluctuating resources. Extensive experiments across hardware, models, contexts, and compression schemes show consistent TTFT reductions (around 2–3x on average) and notable throughput gains, with minimal overhead. This work provides a practical, scalable solution for optimizing inference in large-scale AI deployments by bridging compute and storage bottlenecks.

Abstract

Large Language Models (LLMs) are increasingly deployed in large-scale online services, enabling sophisticated applications. However, the computational overhead of generating key-value (KV) caches in the prefill stage presents a major bottleneck, particularly for long-context inputs. Prefix caching mitigates this issue by storing KV caches for reuse, reducing redundant computation. Despite its advantages, prefix caching suffers from high latency due to the limited I/O bandwidth of storage devices, constraining inference efficiency. To address this challenge, we introduce Cake, a novel KV cache loading system that optimally utilizes both computational and I/O resources in parallel. Cake employs a bidirectional scheduling strategy that dynamically balances KV cache computation and loading, ensuring efficient resource utilization. Additionally, Cake incorporates an adaptive scheduling mechanism that seamlessly integrates with non-prefix caching requests, improving system throughput and adapting to fluctuating resource availabilty. Through extensive evaluations across various hardware configurations, datasets, and storage conditions, Cake achieves on average 2.6x reduction in Time to First Token (TTFT) compared to compute-only and I/O-only methods. Our findings highlight Cake as an effective and practical solution for optimizing long-context LLM inference, bridging the gap between computation and I/O efficiency in large-scale AI deployments.

Compute Or Load KV Cache? Why Not Both?

TL;DR

Cake addresses the latency bottleneck of KV cache prefill in long-context LLM inference by balancing computation and I/O through a bidirectional, chunk-based loading strategy. The approach jointly leverages compute from the sequence head and I/O from the tail, with adaptive scheduling to accommodate non-prefix requests and fluctuating resources. Extensive experiments across hardware, models, contexts, and compression schemes show consistent TTFT reductions (around 2–3x on average) and notable throughput gains, with minimal overhead. This work provides a practical, scalable solution for optimizing inference in large-scale AI deployments by bridging compute and storage bottlenecks.

Abstract

Large Language Models (LLMs) are increasingly deployed in large-scale online services, enabling sophisticated applications. However, the computational overhead of generating key-value (KV) caches in the prefill stage presents a major bottleneck, particularly for long-context inputs. Prefix caching mitigates this issue by storing KV caches for reuse, reducing redundant computation. Despite its advantages, prefix caching suffers from high latency due to the limited I/O bandwidth of storage devices, constraining inference efficiency. To address this challenge, we introduce Cake, a novel KV cache loading system that optimally utilizes both computational and I/O resources in parallel. Cake employs a bidirectional scheduling strategy that dynamically balances KV cache computation and loading, ensuring efficient resource utilization. Additionally, Cake incorporates an adaptive scheduling mechanism that seamlessly integrates with non-prefix caching requests, improving system throughput and adapting to fluctuating resource availabilty. Through extensive evaluations across various hardware configurations, datasets, and storage conditions, Cake achieves on average 2.6x reduction in Time to First Token (TTFT) compared to compute-only and I/O-only methods. Our findings highlight Cake as an effective and practical solution for optimizing long-context LLM inference, bridging the gap between computation and I/O efficiency in large-scale AI deployments.
Paper Structure (20 sections, 2 equations, 7 figures, 6 tables, 1 algorithm)

This paper contains 20 sections, 2 equations, 7 figures, 6 tables, 1 algorithm.

Figures (7)

  • Figure 1: Workflow of long-context LLM inference with prefix caching. Cake operates during the KV cache loading phase (highlighted in blue). The configuration parameters are based on the specifications of a LambdaLab GPU server lambda.
  • Figure 2: Workflow of Cake: Computation starts from the beginning of the sequence, while I/O loading starts from the end. Both processes progress in parallel and merge in the middle, ensuring efficient KV cache loading and minimal latency.
  • Figure 3: Comparison of equivalent KV cache loading bandwidth (bytes/second) across different storage mediums and GPU computation. (Bandwidth for GPU computation is calculated by dividing the total KV cache size by processing time.)
  • Figure 4: Chunk prefill time per step using different methods v.s. chunk index.
  • Figure 5: Cake trace under fluctuate network and available computation power. Hardware: A100, Model: Long-Alpaca-7B, I/O Bandwidth: 0-25Gbps, Compute Utilization: 0-512 budget, Seq-len: 16k.
  • ...and 2 more figures