Table of Contents
Fetching ...

MuxServe: Flexible Spatial-Temporal Multiplexing for Multiple LLM Serving

Jiangfei Duan, Runyu Lu, Haojie Duanmu, Xiuhong Li, Xingcheng Zhang, Dahua Lin, Ion Stoica, Hao Zhang

TL;DR

MuxServe tackles the inefficiency of serving multiple LLMs with uneven demand by introducing a flexible spatial-temporal multiplexing framework. It formalizes the problem, presents an enumeration-based placement plus adaptive batch scheduling, and deploys a unified memory/resource manager to share KV caches and weights across colocated LLMs. Empirical results on synthetic and real workloads show up to 1.8x throughput and 2.9x more requests within 99% SLO, with substantial gains when LLM popularity diverges. This approach offers a practical path to scalable, cost-efficient multi-LLM endpoints in production settings.

Abstract

Large language models (LLMs) have demonstrated remarkable performance, and organizations are racing to serve LLMs of varying sizes as endpoints for use-cases like chat, programming and search. However, efficiently serving multiple LLMs poses significant challenges for existing approaches due to varying popularity of LLMs. In the paper, we present MuxServe, a flexible spatial-temporal multiplexing system for efficient multiple LLM serving. The key insight behind is to colocate LLMs considering their popularity to multiplex memory resources, and leverage the characteristics of prefill and decoding phases to separate and flexibly colocate them to multiplex computation resources. MuxServe formally formulates the multiplexing problem, and proposes a novel placement algorithm and adaptive batch scheduling strategy to identify optimal colocations and maximize utilization. MuxServe designs a unified resource manager to enable flexible and efficient multiplexing. Evaluation results show that MuxServe can achieves up to $1.8\times$ higher throughput or processes $2.9\times$ more requests within $99\%$ SLO attainment. The code is available at: \url{https://github.com/hao-ai-lab/MuxServe}.

MuxServe: Flexible Spatial-Temporal Multiplexing for Multiple LLM Serving

TL;DR

MuxServe tackles the inefficiency of serving multiple LLMs with uneven demand by introducing a flexible spatial-temporal multiplexing framework. It formalizes the problem, presents an enumeration-based placement plus adaptive batch scheduling, and deploys a unified memory/resource manager to share KV caches and weights across colocated LLMs. Empirical results on synthetic and real workloads show up to 1.8x throughput and 2.9x more requests within 99% SLO, with substantial gains when LLM popularity diverges. This approach offers a practical path to scalable, cost-efficient multi-LLM endpoints in production settings.

Abstract

Large language models (LLMs) have demonstrated remarkable performance, and organizations are racing to serve LLMs of varying sizes as endpoints for use-cases like chat, programming and search. However, efficiently serving multiple LLMs poses significant challenges for existing approaches due to varying popularity of LLMs. In the paper, we present MuxServe, a flexible spatial-temporal multiplexing system for efficient multiple LLM serving. The key insight behind is to colocate LLMs considering their popularity to multiplex memory resources, and leverage the characteristics of prefill and decoding phases to separate and flexibly colocate them to multiplex computation resources. MuxServe formally formulates the multiplexing problem, and proposes a novel placement algorithm and adaptive batch scheduling strategy to identify optimal colocations and maximize utilization. MuxServe designs a unified resource manager to enable flexible and efficient multiplexing. Evaluation results show that MuxServe can achieves up to higher throughput or processes more requests within SLO attainment. The code is available at: \url{https://github.com/hao-ai-lab/MuxServe}.
Paper Structure (21 sections, 3 equations, 12 figures, 1 table, 3 algorithms)

This paper contains 21 sections, 3 equations, 12 figures, 1 table, 3 algorithms.

Figures (12)

  • Figure 1: Three multiplexing strategies and GPU utilization of serving two LLMs on two GPUs.
  • Figure 2: The dynamic request arrival rates of different LLMs over a 20 day period.
  • Figure 3: Relative batch inference latency as the fraction of computing resources assigned to LLaMA-7B changes from $30\%$ to $100\%$. The input sequence length is 128.
  • Figure 4: Overview of GPU resource management in an LLM unit. The memory is divided into 3 partitions to store KV cache, weights and activations, respectively. The parallel runtime partitions SM dynamically to different jobs.
  • Figure 5: Throughput and SLO attainment on synthetic workloads.
  • ...and 7 more figures