Table of Contents
Fetching ...

SUN: Shared Use of Next-token Prediction for Efficient Multi-LLM Disaggregated Serving

Sunghyeon Woo, Ahreum Seo, Jaegwang Lee, Jaeeun Kil, Hanbae Seo, Joonghoon Kim, Baeseong Park, Se Jung Kwon, Dongsoo Lee

TL;DR

SUN is proposed, the first approach that enables cross-model sharing of decode execution in disaggregated multi-LLM serving and achieves accuracy comparable to full fine-tuning while maintaining system throughput with fewer decode workers.

Abstract

In multi-model LLM serving, decode execution remains inefficient due to model-specific resource partitioning: since cross-model batching is not possible, memory-bound decoding often suffers from severe GPU underutilization, especially under skewed workloads. We propose Shared Use of Next-token Prediction (SUN), the first approach that enables cross-model sharing of decode execution in disaggregated multi-LLM serving. SUN decomposes a decoder-only Transformer into a prefill module and a decode module, and fine-tunes only the task-specific prefill module, enabling a frozen decode module to be shared across models. This design enables a model-agnostic decode routing policy that balances decode requests across shared workers to maximize utilization. Across diverse tasks and model families, SUN achieves accuracy comparable to full fine-tuning while maintaining system throughput with fewer decode workers. In particular, SUN improves throughput per GPU by up to 2.0x over conventional disaggregation while keeping time-per-output-token (TPOT) within 5%. SUN inherently enables and facilitates low-bit decoding; with Quantized SUN (QSUN), it achieves a 45% speedup with comparable accuracy to SUN while preserving the benefits of shared decoding.

SUN: Shared Use of Next-token Prediction for Efficient Multi-LLM Disaggregated Serving

TL;DR

SUN is proposed, the first approach that enables cross-model sharing of decode execution in disaggregated multi-LLM serving and achieves accuracy comparable to full fine-tuning while maintaining system throughput with fewer decode workers.

Abstract

In multi-model LLM serving, decode execution remains inefficient due to model-specific resource partitioning: since cross-model batching is not possible, memory-bound decoding often suffers from severe GPU underutilization, especially under skewed workloads. We propose Shared Use of Next-token Prediction (SUN), the first approach that enables cross-model sharing of decode execution in disaggregated multi-LLM serving. SUN decomposes a decoder-only Transformer into a prefill module and a decode module, and fine-tunes only the task-specific prefill module, enabling a frozen decode module to be shared across models. This design enables a model-agnostic decode routing policy that balances decode requests across shared workers to maximize utilization. Across diverse tasks and model families, SUN achieves accuracy comparable to full fine-tuning while maintaining system throughput with fewer decode workers. In particular, SUN improves throughput per GPU by up to 2.0x over conventional disaggregation while keeping time-per-output-token (TPOT) within 5%. SUN inherently enables and facilitates low-bit decoding; with Quantized SUN (QSUN), it achieves a 45% speedup with comparable accuracy to SUN while preserving the benefits of shared decoding.
Paper Structure (51 sections, 7 equations, 5 figures, 4 tables)

This paper contains 51 sections, 7 equations, 5 figures, 4 tables.

Figures (5)

  • Figure 1: Overview of SUN for disaggregated multi-LLM serving.(Left): In conventional disaggregated serving, decode workers are isolated per model, causing persistent GPU underutilization, especially for memory-bound decode execution under skewed multi-model workloads. (Right): SUN enables sharing a frozen decode module across different fine-tuned models by fine-tuning task-specific prefill modules. This design improves GPU utilization, and reduces total cost of ownership (TCO).
  • Figure 2: Impact of KV cache reuse strategies on GSM8K accuracy. Naive reuse of fine-tuned prefill caches with base model degrades accuracy significantly. In contrast, SUN matches full fine-tuning accuracy for LLaMA3.1-8B and Qwen3-8B-Base.
  • Figure 3: Throughput--interactivity trade-off under decode-GPU consolidation. Comparison of a per-model partitioned baseline (4$\times$(1P/1D)) with SUN, which uses task-specific prefill workers (4P) and shares a variable number of decode workers (4D/3D/2D/1D), under uniform ($\alpha{=}0$) and skewed ($\alpha{=}1.5$) request distributions and varying output sequence lengths (OSL). Bars indicate throughput per GPU and lines indicate interactivity ($=1/\mathrm{TPOT}$); the table reports total system throughput (TPUT). Near-constant total throughput is maintained when consolidating decode GPUs down to 2D, revealing a controllable throughput-latency trade-off under skewed workloads.
  • Figure 4: Effect of workload skew on interactivity and throughput. Baseline (blue) partitions resources per model, where each of the four models uses one dedicated prefill GPU and one dedicated decode GPU (1P/1D $\times$ 4). SUN (orange) assigns four task-specific prefill GPUs and shares four decode GPUs across models (4P/4D). As the Zipf skew $\alpha$ increases, SUN maintains stable throughput and interactivity, demonstrating robustness to request imbalance.
  • Figure 5: Achieved/offered request-rate ratio (ISL=1024, OSL=2048) across Zipf skew $\alpha$ and offered total RPS. Ratios close to 1 indicate that the system sustains the injected load; lower ratios indicate overload/backlog. At the operating point used in the main skew sweep (offered total RPS$=2$), the ratio is generally below 1 and degrades with increasing $\alpha$, indicating incipient backlog under skew; SUN remains consistently closer to 1 than the baseline.