Table of Contents
Fetching ...

FlowKV: A Disaggregated Inference Framework with Low-Latency KV Cache Transfer and Load-Aware Scheduling

Weiqing Li, Guochao Jiang, Xiangyong Ding, Zhangcheng Tao, Chuzhan Hao, Chenfeng Xu, Yuewei Zhang, Hao Wang

TL;DR

FlowKV addresses the KV-cache transfer bottleneck in disaggregated prefill/decode LLM inference by optimizing KV-cache structure, memory allocation, and transfer pipelines, coupled with a Load-Aware Scheduler that adapts to normal, imbalanced, and extreme loads. It reshapes KV caches from $(L,2,B,H)$ to $(B,L,2,H)$ to drastically reduce NCCL calls, uses segment-based memory management, and aligns block IDs to enable single-transfer operations. The framework demonstrates up to $96\%$ reduction in KV-cache transfer latency and substantial throughput improvements over existing open-source PD-disaggregated systems across homogeneous and heterogeneous GPU deployments, including real-world LongBench tasks with LLaMA-3 models. These results highlight FlowKV's practical impact for scalable, low-latency disaggregated inference in varied hardware environments.

Abstract

Disaggregated inference has become an essential framework that separates the prefill (P) and decode (D) stages in large language model inference to improve throughput. However, the KV cache transfer faces significant delays between prefill and decode nodes. The block-wise calling method and discontinuous KV cache memory allocation increase the number of calls to the transmission kernel. Additionally, existing frameworks often fix the roles of P and D nodes, leading to computational imbalances. In this paper, we propose FlowKV, a novel disaggregated inference framework, which reduces the average transmission latency of KV cache by 96%, from 0.944s to 0.053s, almost eliminating the transfer time relative to the total request latency by optimizing the KV cache transfer. FlowKV introduces the Load-Aware Scheduler for balanced request scheduling and flexible PD node allocation. This design maximizes hardware resource utilization, achieving peak system throughput across various scenarios, including normal, computational imbalance, and extreme overload conditions. Experimental results demonstrate that FlowKV significantly accelerates inference by 15.2%-48.9% on LongBench dataset compared to the baseline and supports applications with heterogeneous GPUs.

FlowKV: A Disaggregated Inference Framework with Low-Latency KV Cache Transfer and Load-Aware Scheduling

TL;DR

FlowKV addresses the KV-cache transfer bottleneck in disaggregated prefill/decode LLM inference by optimizing KV-cache structure, memory allocation, and transfer pipelines, coupled with a Load-Aware Scheduler that adapts to normal, imbalanced, and extreme loads. It reshapes KV caches from to to drastically reduce NCCL calls, uses segment-based memory management, and aligns block IDs to enable single-transfer operations. The framework demonstrates up to reduction in KV-cache transfer latency and substantial throughput improvements over existing open-source PD-disaggregated systems across homogeneous and heterogeneous GPU deployments, including real-world LongBench tasks with LLaMA-3 models. These results highlight FlowKV's practical impact for scalable, low-latency disaggregated inference in varied hardware environments.

Abstract

Disaggregated inference has become an essential framework that separates the prefill (P) and decode (D) stages in large language model inference to improve throughput. However, the KV cache transfer faces significant delays between prefill and decode nodes. The block-wise calling method and discontinuous KV cache memory allocation increase the number of calls to the transmission kernel. Additionally, existing frameworks often fix the roles of P and D nodes, leading to computational imbalances. In this paper, we propose FlowKV, a novel disaggregated inference framework, which reduces the average transmission latency of KV cache by 96%, from 0.944s to 0.053s, almost eliminating the transfer time relative to the total request latency by optimizing the KV cache transfer. FlowKV introduces the Load-Aware Scheduler for balanced request scheduling and flexible PD node allocation. This design maximizes hardware resource utilization, achieving peak system throughput across various scenarios, including normal, computational imbalance, and extreme overload conditions. Experimental results demonstrate that FlowKV significantly accelerates inference by 15.2%-48.9% on LongBench dataset compared to the baseline and supports applications with heterogeneous GPUs.

Paper Structure

This paper contains 18 sections, 3 equations, 5 figures, 3 tables, 1 algorithm.

Figures (5)

  • Figure 1: Time distribution of Prefill + Decode and KV Cache Transfer in a single request with NCCL-based transfer based original PagedAttention vllm. The request is sampled from the LongBench longbench, with an input length of 13k and an output length of 100. In this case, the KV Cache Transfer time between the prefill node and the decode node in the disaggregated framework accounts for about a quarter of the entire inference latency.
  • Figure 2: FlowKV framework. FlowKV enables high-speed transmission of KV Cache between P and D nodes through KV Cache transfer module. Furthermore, it employs a global controller to monitor the workload and KV cache status of P and D nodes in real-time. Under different load conditions (normal load, imbalanced load, extreme load), the system adopts load balancing strategies for request scheduling or performs elastic node scaling to ensure efficient resource utilization.
  • Figure 3: Throughput performance using simulated data in homogeneous deployment scenario
  • Figure 4: E2E performance in heterogeneous deployment scenario
  • Figure 5: Comparison of the KV Cache transfer process in FlowKV with the pre-optimization approach. FlowKV aims to allocate the KV cache block IDs of both the sending and receiving instances within a contiguous memory segment as much as possible. Prior to the transfer operation, bidirectional segment alignment is performed to compare the block ID lists of both the sending and receiving instances, identifying N contiguous block IDs that are present on both sides. Since these N block IDs are memory-contiguous, they can be transferred in a single operation. The diagram illustrates an ideal scenario where the number of NCCL kernel calls in the KV Cache transfer operation is optimized from $O(n)$ times to $O(1)$.