Table of Contents
Fetching ...

Not All Prefills Are Equal: PPD Disaggregation for Multi-turn LLM Serving

Zongze Li, Jingyu Liu, Zach Xu, Yineng Zhang, Tahseen Rabbani, Ce Zhang

Abstract

Prefill-Decode (PD) disaggregation has become the standard architecture for modern LLM inference engines, which alleviates the interference of two distinctive workloads. With the growing demand for multi-turn interactions in chatbots and agentic systems, we re-examined PD in this case and found two fundamental inefficiencies: (1) every turn requires prefilling the new prompt and response from the last turn, and (2) repeated KV transfers between prefill and decode nodes saturate the bandwidth, leading to high latency and even service degradation. Our key insight is that not all prefill operations are equally disruptive: append-prefill -- processing only the new input tokens while reusing cached KV states -- incurs substantially less decoding slowdown than full prefill. This motivates routing append-prefill to decode nodes locally. However, through comprehensive analysis, we show that no single fixed routing strategy satisfies all Service Level Objectives (SLOs) simultaneously. Based on this insight, we propose Prefill Prefill-capable Decode (PPD) disaggregation, a dynamic routing system that decides when to process Turn 2+ requests locally on decode nodes using cached KV states. PPD adapts to varying SLOs via configurable weights and seamlessly integrates with traditional PD deployments. With extensive evaluations, we show that PPD reduces Turn 2+ time-to-first-token (TTFT) by 68% while maintaining competitive time-per-output-token (TPOT), effectively alleviating KV transfer congestion under high load. We believe PPD represents a flexible and efficient paradigm for multi-turn LLM serving.

Not All Prefills Are Equal: PPD Disaggregation for Multi-turn LLM Serving

Abstract

Prefill-Decode (PD) disaggregation has become the standard architecture for modern LLM inference engines, which alleviates the interference of two distinctive workloads. With the growing demand for multi-turn interactions in chatbots and agentic systems, we re-examined PD in this case and found two fundamental inefficiencies: (1) every turn requires prefilling the new prompt and response from the last turn, and (2) repeated KV transfers between prefill and decode nodes saturate the bandwidth, leading to high latency and even service degradation. Our key insight is that not all prefill operations are equally disruptive: append-prefill -- processing only the new input tokens while reusing cached KV states -- incurs substantially less decoding slowdown than full prefill. This motivates routing append-prefill to decode nodes locally. However, through comprehensive analysis, we show that no single fixed routing strategy satisfies all Service Level Objectives (SLOs) simultaneously. Based on this insight, we propose Prefill Prefill-capable Decode (PPD) disaggregation, a dynamic routing system that decides when to process Turn 2+ requests locally on decode nodes using cached KV states. PPD adapts to varying SLOs via configurable weights and seamlessly integrates with traditional PD deployments. With extensive evaluations, we show that PPD reduces Turn 2+ time-to-first-token (TTFT) by 68% while maintaining competitive time-per-output-token (TPOT), effectively alleviating KV transfer congestion under high load. We believe PPD represents a flexible and efficient paradigm for multi-turn LLM serving.
Paper Structure (48 sections, 2 equations, 9 figures, 4 tables, 1 algorithm)

This paper contains 48 sections, 2 equations, 9 figures, 4 tables, 1 algorithm.

Figures (9)

  • Figure 1: P99 TTFT vs. Tokens-per-Second (TPS) Pareto frontiers under a long-context workload (10k input, 100 output tokens, 5 turns) at three load levels. Higher TPS and lower TTFT are better (upper-left is ideal). Baselines (orange): PD and Replica configurations where Turn 2+ requests always require P-node processing. D-local capable (blue): configurations allowing decode nodes to process Turn 2+ locally via append-prefill. The "Best" annotation marks the configuration selected by our dynamic PPD routing system, validating correct trade-off optimization. See \ref{['fig:pareto-small-context']} in the Appendix for small-context results showing consistent trends across different workloads.
  • Figure 2: Prefill-decode interference. Decode TPOT degradation when co-locating with one full-prefill vs. one append-prefill operation (both processing 1,024 tokens). Full prefill causes a significant slowdown; append prefill remains close to baseline. See \ref{['fig:interference-4prefills']} in the Appendix for 4-prefill experiments showing consistent trends.
  • Figure 3: Dynamic routing of append-prefill with PPD. We illustrate the core concept behind PPD: PPD (Right) dynamically routes the append-prefill operations for multi-turn conversations to the prefill or decode nodes based on user SLO, estimated workload from the system, and the initial node configuration. Compared to Replica (Left), PPD retains all the benefits of disaggregation. In contrast to PD (Middle), PPD alleviates heavy KV cache transfer as well as extra recomputation for append-prefill with local cache, and can always adjust to meet various serving requirements. We want to highlight that both PD and any strategies with a fixed $x\%$ of append-prefill routed to decode nodes are special cases of our PPD.
  • Figure 4: PPD improves stability and reduces latency. Average query latency vs. QPS for three configurations (1P_3D, 2P_2D, 3P_1D) on ShareGPT (blue) and WildChat (orange) datasets. Dashed lines: PD ($x{=}0$); solid lines: PPD ($x \in [0,1]$). $\times$ markers indicate service degradation (success rate $<$95% due to request timeout). PPD consistently achieves lower latency than PD while maintaining stability across the entire QPS range.
  • Figure 5: Weight-based TTFT-TPOT trade-off. Turn 2+ latency for 1P_3D on prefill-heavy workloads. The gray band shows the achievable trade-off frontier. Markers indicate D-local routing ratios: 0% (Static PD baseline), 20% ($w_{\text{tpot}}{=}6$), 50% ($w_{\text{tpot}}{=}3$), and 95% (balanced, $w_{\text{tpot}}{=}1$). Higher $w_{\text{tpot}}$ penalizes TPOT degradation, routing fewer requests to D locally.
  • ...and 4 more figures