ModServe: Modality- and Stage-Aware Resource Disaggregation for Scalable Multimodal Model Serving

Haoran Qiu; Anish Biswas; Zihan Zhao; Jayashree Mohan; Alind Khare; Esha Choukse; Íñigo Goiri; Zeyu Zhang; Haiying Shen; Chetan Bansal; Ramachandran Ramjee; Rodrigo Fonseca

ModServe: Modality- and Stage-Aware Resource Disaggregation for Scalable Multimodal Model Serving

Haoran Qiu, Anish Biswas, Zihan Zhao, Jayashree Mohan, Alind Khare, Esha Choukse, Íñigo Goiri, Zeyu Zhang, Haiying Shen, Chetan Bansal, Ramachandran Ramjee, Rodrigo Fonseca

TL;DR

The paper analyzes production-scale serving for large multimodal models and identifies image encoding as a central bottleneck and bursty, modality-driven traffic as a key challenge. It introduces ModServe, a modular, stage-disaggregated serving framework that separates image preprocessing/encoding from LLM prefill/decode, augmented by offline profiling, stage-specific autoscaling, and modality-aware routing. Empirical results on a 128-GPU Azure-grade trace workload show 3.3–5.5x throughput gains and 25–41.3% cost savings over monolithic baselines, with strong improvements under bursty, image-heavy traffic and compatibility with PD disaggregation. The work provides a practical, production-oriented approach to scalable LMM serving, demonstrated with extensive open-source LMM characterization and real production traces, and offers a foundation for extending to other multimodal scenarios and architectures.

Abstract

Large multimodal models (LMMs) demonstrate impressive capabilities in understanding images, videos, and audio beyond text. However, efficiently serving LMMs in production environments poses significant challenges due to their complex architectures and heterogeneous characteristics across their multi-stage inference pipelines. We present the first comprehensive systems analysis of two prominent LMM architectures, decoder-only and cross-attention, across six representative open-source models, revealing key systems design implications. We also present an in-depth analysis of production LMM inference traces, uncovering unique workload characteristics, including variable, heavy-tailed request distributions and bursty traffic patterns. Based on these insights, we propose ModServe, a modular LMM serving system that decouples stages for independent optimization and adaptive scaling. ModServe dynamically reconfigures stages and handles bursty traffic with modality-aware scheduling and autoscaling to meet tail latency SLOs while minimizing costs. ModServe achieves 3.3-5.5x higher throughput (leading to 25-41.3% cost saving) while meeting SLOs on a 128-GPU cluster with production traces.

ModServe: Modality- and Stage-Aware Resource Disaggregation for Scalable Multimodal Model Serving

TL;DR

Abstract

ModServe: Modality- and Stage-Aware Resource Disaggregation for Scalable Multimodal Model Serving

TL;DR

Abstract

Paper Structure

Table of Contents

Figures (20)