Table of Contents
Fetching ...

PrefillShare: A Shared Prefill Module for KV Reuse in Multi-LLM Disaggregated Serving

Sunghyeon Woo, Hoseung Kim, Sunghwan Shim, Minjung Jo, Hyunjoon Jeong, Jeongtae Lee, Joonghoon Kim, Sungjae Lee, Baeseong Park, Se Jung Kwon, Dongsoo Lee

TL;DR

Multi-model agent workflows incur redundant prefill computation and per-model KV caches, increasing latency and memory usage. PrefillShare decouples a shared base prefill module from task-specific decoders and trains decoders to condition on the base cache via cache-conditioned fine-tuning, enabling cross-model prefix KV reuse without sacrificing accuracy. The approach is augmented with a prefix-locality routing and a disaggregated inference workflow in a vLLM-based system to support heterogeneous models. Empirical results show near Full-FT accuracy across tasks and model scales, with up to 4.5x reductions in p95 latency and up to 3.9x throughput gains in multi-model workloads, demonstrating scalable, efficient, and robust sharing of prefill computations for agentic LLM serving.

Abstract

Multi-agent systems increasingly orchestrate multiple specialized language models to solve complex real-world problems, often invoking them over a shared context. This execution pattern repeatedly processes the same prompt prefix across models. Consequently, each model redundantly executes the prefill stage and maintains its own key-value (KV) cache, increasing aggregate prefill load and worsening tail latency by intensifying prefill-decode interference in existing LLM serving stacks. Disaggregated serving reduces such interference by placing prefill and decode on separate GPUs, but disaggregation does not fundamentally eliminate inter-model redundancy in computation and KV storage for the same prompt. To address this issue, we propose PrefillShare, a novel algorithm that enables sharing the prefill stage across multiple models in a disaggregated setting. PrefillShare factorizes the model into prefill and decode modules, freezes the prefill module, and fine-tunes only the decode module. This design allows multiple task-specific models to share a prefill module and the KV cache generated for the same prompt. We further introduce a routing mechanism that enables effective prefill sharing across heterogeneous models in a vLLM-based disaggregated system. PrefillShare not only matches full fine-tuning accuracy on a broad range of tasks and models, but also delivers 4.5x lower p95 latency and 3.9x higher throughput in multi-model agent workloads.

PrefillShare: A Shared Prefill Module for KV Reuse in Multi-LLM Disaggregated Serving

TL;DR

Multi-model agent workflows incur redundant prefill computation and per-model KV caches, increasing latency and memory usage. PrefillShare decouples a shared base prefill module from task-specific decoders and trains decoders to condition on the base cache via cache-conditioned fine-tuning, enabling cross-model prefix KV reuse without sacrificing accuracy. The approach is augmented with a prefix-locality routing and a disaggregated inference workflow in a vLLM-based system to support heterogeneous models. Empirical results show near Full-FT accuracy across tasks and model scales, with up to 4.5x reductions in p95 latency and up to 3.9x throughput gains in multi-model workloads, demonstrating scalable, efficient, and robust sharing of prefill computations for agentic LLM serving.

Abstract

Multi-agent systems increasingly orchestrate multiple specialized language models to solve complex real-world problems, often invoking them over a shared context. This execution pattern repeatedly processes the same prompt prefix across models. Consequently, each model redundantly executes the prefill stage and maintains its own key-value (KV) cache, increasing aggregate prefill load and worsening tail latency by intensifying prefill-decode interference in existing LLM serving stacks. Disaggregated serving reduces such interference by placing prefill and decode on separate GPUs, but disaggregation does not fundamentally eliminate inter-model redundancy in computation and KV storage for the same prompt. To address this issue, we propose PrefillShare, a novel algorithm that enables sharing the prefill stage across multiple models in a disaggregated setting. PrefillShare factorizes the model into prefill and decode modules, freezes the prefill module, and fine-tunes only the decode module. This design allows multiple task-specific models to share a prefill module and the KV cache generated for the same prompt. We further introduce a routing mechanism that enables effective prefill sharing across heterogeneous models in a vLLM-based disaggregated system. PrefillShare not only matches full fine-tuning accuracy on a broad range of tasks and models, but also delivers 4.5x lower p95 latency and 3.9x higher throughput in multi-model agent workloads.
Paper Structure (35 sections, 9 equations, 6 figures, 2 tables)

This paper contains 35 sections, 9 equations, 6 figures, 2 tables.

Figures (6)

  • Figure 1: Comparison of a typical multi-model system and PrefillShare.(Left) In a typical multi-model system, each task-specific model is fine-tuned and served independently, maintaining its own KV cache for identical input prompts. PrefillShare decouples prefill and decoding into a shared prefill module and task-specific decode modules, fine-tuning only the decode modules, thereby enabling multiple task-specific decoders to reuse the same prompt KV cache. (Right) In a multi-model disaggregated serving (baseline), substantial computational overhead arises from redundant prefill on identical prompts and frequent recomputation due to explosive KV-cache growth and the resulting evictions under increasing load. PrefillShare shares KV caches across models to enable effective prefix caching and prevent early cache saturation, substantially reducing computation and achieving up to $4\times$ higher throughput under high load.
  • Figure 2: GSM8K accuracy as a function of KV cache sharing ratio between the base and fine-tuned models. Naive sharing without cache-adaptive fine-tuning collapses at high sharing ratios, while PrefillShare preserves near Full-FT accuracy even at 100% sharing.
  • Figure 3: Serving performance under multi-model agent workloads. We compare the disaggregated baseline and PrefillShare under increasing session arrival rates for two representative agentic patterns: ReAct (top) and Reflexion (bottom). We report p95 end-to-end latency, throughput, and TTFT. PrefillShare outperforms the baseline in both patterns, with the gap widening as the session arrival rate increases, primarily due to rising prefix-cache miss rates in the baseline.
  • Figure 4: Prefix cache hit ratio and throughput under varying max concurrent sessions. Top: prefix cache hit ratio (%). Bottom: throughput (tok/s). The baseline degrades beyond $\approx$40 sessions as reduced prefix reuse lowers both metrics, whereas PrefillShare sustains higher throughput over a wider range, with high-concurrency saturation driven by handoff overheads.
  • Figure 5: Serving performance under multi-model agent workloads using Qwen3-14B. We replicate the experimental setup of Fig. \ref{['fig:inference_results']}, replacing the LLaMA3.1-8B backbone with Qwen3-14B while keeping all other settings identical. Results are shown for two representative agentic patterns, ReAct (top) and Reflexion (bottom), reporting p95 end-to-end latency, throughput, and TTFT.
  • ...and 1 more figures