Table of Contents
Fetching ...

FLYING SERVING: On-the-Fly Parallelism Switching for Large Language Model Serving

Shouwei Gao, Junqi Yin, Feiyi Wang, Wenqian Dong

TL;DR

Flying Serving is presented, a vLLM-based system that enables online DP-TP switching without restarting engine workers and makes reconfiguration practical by virtualizing the state that would otherwise force data movement.

Abstract

Production LLM serving must simultaneously deliver high throughput, low latency, and sufficient context capacity under non-stationary traffic and mixed request requirements. Data parallelism (DP) maximizes throughput by running independent replicas, while tensor parallelism (TP) reduces per-request latency and pools memory for long-context inference. However, existing serving stacks typically commit to a static parallelism configuration at deployment; adapting to bursts, priorities, or long-context requests is often disruptive and slow. We present Flying Serving, a vLLM-based system that enables online DP-TP switching without restarting engine workers. Flying Serving makes reconfiguration practical by virtualizing the state that would otherwise force data movement: (i) a zero-copy Model Weights Manager that exposes TP shard views on demand, (ii) a KV Cache Adaptor that preserves request KV state across DP/TP layouts, (iii) an eagerly initialized Communicator Pool to amortize collective setup, and (iv) a deadlock-free scheduler that coordinates safe transitions under execution skew. Across three popular LLMs and realistic serving scenarios, Flying Serving improves performance by up to $4.79\times$ under high load and $3.47\times$ under low load while supporting latency- and memory-driven requests.

FLYING SERVING: On-the-Fly Parallelism Switching for Large Language Model Serving

TL;DR

Flying Serving is presented, a vLLM-based system that enables online DP-TP switching without restarting engine workers and makes reconfiguration practical by virtualizing the state that would otherwise force data movement.

Abstract

Production LLM serving must simultaneously deliver high throughput, low latency, and sufficient context capacity under non-stationary traffic and mixed request requirements. Data parallelism (DP) maximizes throughput by running independent replicas, while tensor parallelism (TP) reduces per-request latency and pools memory for long-context inference. However, existing serving stacks typically commit to a static parallelism configuration at deployment; adapting to bursts, priorities, or long-context requests is often disruptive and slow. We present Flying Serving, a vLLM-based system that enables online DP-TP switching without restarting engine workers. Flying Serving makes reconfiguration practical by virtualizing the state that would otherwise force data movement: (i) a zero-copy Model Weights Manager that exposes TP shard views on demand, (ii) a KV Cache Adaptor that preserves request KV state across DP/TP layouts, (iii) an eagerly initialized Communicator Pool to amortize collective setup, and (iv) a deadlock-free scheduler that coordinates safe transitions under execution skew. Across three popular LLMs and realistic serving scenarios, Flying Serving improves performance by up to under high load and under low load while supporting latency- and memory-driven requests.
Paper Structure (43 sections, 4 equations, 10 figures, 2 tables, 1 algorithm)

This paper contains 43 sections, 4 equations, 10 figures, 2 tables, 1 algorithm.

Figures (10)

  • Figure 1: Attention/FFN operators (simplified).
  • Figure 2: Model layouts on a 4-GPU node in DP vs. TP.
  • Figure 3: Overview of the Flying Serving architecture. The system functions as a middleware layer that orchestrates multiple engine workers, enabling dynamic transitions between DP and TP. The timeline on the right illustrates how the system adapts to different request types, such as high-priority, long-context, and latency-strict tasks, by reconfiguring workers from independent DP instances into cooperative TP groups on the fly.
  • Figure 4: Model Weights Manager architecture for zero-copy DP/TP switching.
  • Figure 5: An example of KV cache adaptation in Flying Serving. With per-token footprint shrinks with TP (DP: $N\!\times\!D$; 4TP: $N\!\times\!D/4$), to keep a fixed physical layout without reallocation, we scale block size inversely with TP: 4 tokens (DP), 8 (2TP), 16 (4TP), managed per request by the KV cache adaptor; physical memory of each block is unchanged.
  • ...and 5 more figures