Table of Contents
Fetching ...

ModServe: Modality- and Stage-Aware Resource Disaggregation for Scalable Multimodal Model Serving

Haoran Qiu, Anish Biswas, Zihan Zhao, Jayashree Mohan, Alind Khare, Esha Choukse, Íñigo Goiri, Zeyu Zhang, Haiying Shen, Chetan Bansal, Ramachandran Ramjee, Rodrigo Fonseca

TL;DR

The paper analyzes production-scale serving for large multimodal models and identifies image encoding as a central bottleneck and bursty, modality-driven traffic as a key challenge. It introduces ModServe, a modular, stage-disaggregated serving framework that separates image preprocessing/encoding from LLM prefill/decode, augmented by offline profiling, stage-specific autoscaling, and modality-aware routing. Empirical results on a 128-GPU Azure-grade trace workload show 3.3–5.5x throughput gains and 25–41.3% cost savings over monolithic baselines, with strong improvements under bursty, image-heavy traffic and compatibility with PD disaggregation. The work provides a practical, production-oriented approach to scalable LMM serving, demonstrated with extensive open-source LMM characterization and real production traces, and offers a foundation for extending to other multimodal scenarios and architectures.

Abstract

Large multimodal models (LMMs) demonstrate impressive capabilities in understanding images, videos, and audio beyond text. However, efficiently serving LMMs in production environments poses significant challenges due to their complex architectures and heterogeneous characteristics across their multi-stage inference pipelines. We present the first comprehensive systems analysis of two prominent LMM architectures, decoder-only and cross-attention, across six representative open-source models, revealing key systems design implications. We also present an in-depth analysis of production LMM inference traces, uncovering unique workload characteristics, including variable, heavy-tailed request distributions and bursty traffic patterns. Based on these insights, we propose ModServe, a modular LMM serving system that decouples stages for independent optimization and adaptive scaling. ModServe dynamically reconfigures stages and handles bursty traffic with modality-aware scheduling and autoscaling to meet tail latency SLOs while minimizing costs. ModServe achieves 3.3-5.5x higher throughput (leading to 25-41.3% cost saving) while meeting SLOs on a 128-GPU cluster with production traces.

ModServe: Modality- and Stage-Aware Resource Disaggregation for Scalable Multimodal Model Serving

TL;DR

The paper analyzes production-scale serving for large multimodal models and identifies image encoding as a central bottleneck and bursty, modality-driven traffic as a key challenge. It introduces ModServe, a modular, stage-disaggregated serving framework that separates image preprocessing/encoding from LLM prefill/decode, augmented by offline profiling, stage-specific autoscaling, and modality-aware routing. Empirical results on a 128-GPU Azure-grade trace workload show 3.3–5.5x throughput gains and 25–41.3% cost savings over monolithic baselines, with strong improvements under bursty, image-heavy traffic and compatibility with PD disaggregation. The work provides a practical, production-oriented approach to scalable LMM serving, demonstrated with extensive open-source LMM characterization and real production traces, and offers a foundation for extending to other multimodal scenarios and architectures.

Abstract

Large multimodal models (LMMs) demonstrate impressive capabilities in understanding images, videos, and audio beyond text. However, efficiently serving LMMs in production environments poses significant challenges due to their complex architectures and heterogeneous characteristics across their multi-stage inference pipelines. We present the first comprehensive systems analysis of two prominent LMM architectures, decoder-only and cross-attention, across six representative open-source models, revealing key systems design implications. We also present an in-depth analysis of production LMM inference traces, uncovering unique workload characteristics, including variable, heavy-tailed request distributions and bursty traffic patterns. Based on these insights, we propose ModServe, a modular LMM serving system that decouples stages for independent optimization and adaptive scaling. ModServe dynamically reconfigures stages and handles bursty traffic with modality-aware scheduling and autoscaling to meet tail latency SLOs while minimizing costs. ModServe achieves 3.3-5.5x higher throughput (leading to 25-41.3% cost saving) while meeting SLOs on a 128-GPU cluster with production traces.

Paper Structure

This paper contains 18 sections, 20 figures, 1 table.

Figures (20)

  • Figure 1: Impact of image/video workload on LMM inference TTFT for state-of-the-art implementation of Llama3.2-11B on vLLM vs. ModServe with an 8-A100 GPU server. The "Monolith" setup deploys the full model using 8 GPUs while the "Decoupled" setup deploys the LLM backend on 4 GPUs and four image encoders on the other 4 GPUs.
  • Figure 2: Model architecture for decoder-only and cross-attention-based LMMs in Image-Text-to-Text tasks ittt.
  • Figure 3: Distribution of image token count (per request) for open-source LMMs on ShareGPT-4o dataset chen2024far. Different LMMs (e.g., LLaVA-OV 7B and 72B) can share the same image encoder so the number of image tokens is the same.
  • Figure 4: Image dimension distribution and text prompt length distribution of ShareGPT-4o Image dataset chen2024far.
  • Figure 5: Per-stage request latency breakdown analysis across representative open-source LMMs deployed using default tensor parallelism (TP) as described in \ref{['table:model-config']}. TTFT (dashed line) is the sum of the latency from each inference stage.
  • ...and 15 more figures