Table of Contents
Fetching ...

DéjàVu: KV-cache Streaming for Fast, Fault-tolerant Generative LLM Serving

Foteini Strati, Sara Mcallister, Amar Phanishayee, Jakub Tarnawski, Ana Klimovic

TL;DR

DéjàVu tackles the inefficiencies of distributed, stateful LLM serving by introducing a KV-cache streaming approach that disaggregates prompt processing from token generation, enabling better GPU utilization and memory management. Central to the system is DéjàVuLib, a modular KV-cache streaming library that supports prompt-token disaggregation, microbatch swapping, and KV-cache replication for fault tolerance. Empirical results show up to 2x throughput improvements over FasterTransformer, along with up to 1.8x gains from swapping and substantial resilience to failures (latency reductions and faster recoveries). The work demonstrates practical scalability across cloud deployments and large models, offering a path to efficient, fault-tolerant LLM serving at scale.

Abstract

Distributed LLM serving is costly and often underutilizes hardware accelerators due to three key challenges: bubbles in pipeline-parallel deployments caused by the bimodal latency of prompt and token processing, GPU memory overprovisioning, and long recovery times in case of failures. In this paper, we propose DéjàVu, a system to address all these challenges using a versatile and efficient KV cache streaming library (DéjàVuLib). Using DéjàVuLib, we propose and implement efficient prompt-token disaggregation to reduce pipeline bubbles, microbatch swapping for efficient GPU memory management, and state replication for fault-tolerance. We highlight the efficacy of these solutions on a range of large models across cloud deployments.

DéjàVu: KV-cache Streaming for Fast, Fault-tolerant Generative LLM Serving

TL;DR

DéjàVu tackles the inefficiencies of distributed, stateful LLM serving by introducing a KV-cache streaming approach that disaggregates prompt processing from token generation, enabling better GPU utilization and memory management. Central to the system is DéjàVuLib, a modular KV-cache streaming library that supports prompt-token disaggregation, microbatch swapping, and KV-cache replication for fault tolerance. Empirical results show up to 2x throughput improvements over FasterTransformer, along with up to 1.8x gains from swapping and substantial resilience to failures (latency reductions and faster recoveries). The work demonstrates practical scalability across cloud deployments and large models, offering a path to efficient, fault-tolerant LLM serving at scale.

Abstract

Distributed LLM serving is costly and often underutilizes hardware accelerators due to three key challenges: bubbles in pipeline-parallel deployments caused by the bimodal latency of prompt and token processing, GPU memory overprovisioning, and long recovery times in case of failures. In this paper, we propose DéjàVu, a system to address all these challenges using a versatile and efficient KV cache streaming library (DéjàVuLib). Using DéjàVuLib, we propose and implement efficient prompt-token disaggregation to reduce pipeline bubbles, microbatch swapping for efficient GPU memory management, and state replication for fault-tolerance. We highlight the efficacy of these solutions on a range of large models across cloud deployments.
Paper Structure (32 sections, 26 equations, 31 figures, 5 tables)

This paper contains 32 sections, 26 equations, 31 figures, 5 tables.

Figures (31)

  • Figure 1: Memory footprint of serving various LLMs with 2K sequence length (input + generated tokens) and half precision (fp16).
  • Figure 2: Prompt processing and average per-token generation time on A100 GPUs, using FasterTransformer (with batch size 8 and prompt size 1000). Y-axis is in log scale.
  • Figure 3: LLM serving with 4-stage pipeline. A stage is a machine with $n$ GPUs running a set of layers with tensor model parallelism. $Px$ shows prompt processing of microbatch $x$. $X_y$ shows token generation for token $y$, microbatch $X$. For simplicity, in this figure, we assume prompt processing time takes 2$\times$ per-token processing time. In reality, the prompt-token difference can be up to 106$\times$ (see Appendix \ref{['app_llm_profiling']}). Grey areas are bubbles due to prompt processing vs. token generation latency discrepancy.
  • Figure 4: Effect on cumulative latency of an inference request when a failure occurs in today's systems, on a GPT2-1.5B model
  • Figure 5: Full DéjàVu system diagram. When disaggregation is enabled, the workers do either only prompt processing (P-worker) or token generation (T-worker). The blue arrows stand for prompt-token cache exchange, the red arrows for cache replication, and the orange arrows for cache swapping.
  • ...and 26 more figures