DéjàVu: KV-cache Streaming for Fast, Fault-tolerant Generative LLM Serving
Foteini Strati, Sara Mcallister, Amar Phanishayee, Jakub Tarnawski, Ana Klimovic
TL;DR
DéjàVu tackles the inefficiencies of distributed, stateful LLM serving by introducing a KV-cache streaming approach that disaggregates prompt processing from token generation, enabling better GPU utilization and memory management. Central to the system is DéjàVuLib, a modular KV-cache streaming library that supports prompt-token disaggregation, microbatch swapping, and KV-cache replication for fault tolerance. Empirical results show up to 2x throughput improvements over FasterTransformer, along with up to 1.8x gains from swapping and substantial resilience to failures (latency reductions and faster recoveries). The work demonstrates practical scalability across cloud deployments and large models, offering a path to efficient, fault-tolerant LLM serving at scale.
Abstract
Distributed LLM serving is costly and often underutilizes hardware accelerators due to three key challenges: bubbles in pipeline-parallel deployments caused by the bimodal latency of prompt and token processing, GPU memory overprovisioning, and long recovery times in case of failures. In this paper, we propose DéjàVu, a system to address all these challenges using a versatile and efficient KV cache streaming library (DéjàVuLib). Using DéjàVuLib, we propose and implement efficient prompt-token disaggregation to reduce pipeline bubbles, microbatch swapping for efficient GPU memory management, and state replication for fault-tolerance. We highlight the efficacy of these solutions on a range of large models across cloud deployments.
