Table of Contents
Fetching ...

ElasticMM: Efficient Multimodal LLMs Serving with Elastic Multimodal Parallelism

Zedong Liu, Shenggan Cheng, Guangming Tan, Yang You, Dingwen Tao

TL;DR

This work tackles inefficiencies in serving multimodal large language models (MLLMs) under heterogeneous and bursty workloads by proposing Elastic Multimodal Parallelism (EMP) and ElasticMM. EMP provides a two-tier scheduling framework that decouples modality groups (text-only vs multimodal) and inference stages (encoding, prefill, decode) to enable elastic resource reallocation and stage-specific parallelism. ElasticMM implements modality-aware load balancing, elastic partition scheduling, and multimodal inference optimizations (unified multimodal prefix caching and non-blocking encoding) to reduce time-to-first-token ($TTFT$) and boost throughput while maintaining accuracy. Empirical results on Llama3.2-Vision-11B and Qwen2.5-VL-7B across VisualWebInstruct and ShareGPT-4o show TTFT reductions up to $4.2\times$ and throughput gains of $3.2$–$4.5\times$ under SLOs, outperforming state-of-the-art baselines. The approach provides a practical and scalable paradigm for efficient, real-time multimodal AI service deployment.

Abstract

Multimodal large language models (MLLMs) extend LLMs to handle images, videos, and audio by incorporating feature extractors and projection modules. However, these additional components -- combined with complex inference pipelines and heterogeneous workloads -- introduce significant inference overhead. Therefore, efficiently serving MLLMs remains a major challenge. Current tightly coupled serving architectures struggle to distinguish between mixed request types or adapt parallelism strategies to different inference stages, leading to increased time-to-first-token (TTFT) latency and poor resource utilization. To address this, we introduce Elastic Multimodal Parallelism (EMP), a new serving paradigm that elastically adapts to resource heterogeneity across request types and inference stages. Building upon EMP, we develop ElasticMM, an MLLM serving system that (1) separates requests into independent modality groups with dynamic resource allocation via a modality-aware load balancer; (2) decouples inference stages and enables parallelism adjustment and adaptive scaling via elastic partition scheduling; and (3) improves inference efficiency through unified multimodal prefix caching and non-blocking encoding. Experiments on diverse real-world datasets show that ElasticMM outperforms state-of-the-art (SOTA) serving systems, reducing TTFT by up to 4.2x and achieving 3.2-4.5x higher throughput while meeting service-level objectives (SLOs).

ElasticMM: Efficient Multimodal LLMs Serving with Elastic Multimodal Parallelism

TL;DR

This work tackles inefficiencies in serving multimodal large language models (MLLMs) under heterogeneous and bursty workloads by proposing Elastic Multimodal Parallelism (EMP) and ElasticMM. EMP provides a two-tier scheduling framework that decouples modality groups (text-only vs multimodal) and inference stages (encoding, prefill, decode) to enable elastic resource reallocation and stage-specific parallelism. ElasticMM implements modality-aware load balancing, elastic partition scheduling, and multimodal inference optimizations (unified multimodal prefix caching and non-blocking encoding) to reduce time-to-first-token () and boost throughput while maintaining accuracy. Empirical results on Llama3.2-Vision-11B and Qwen2.5-VL-7B across VisualWebInstruct and ShareGPT-4o show TTFT reductions up to and throughput gains of under SLOs, outperforming state-of-the-art baselines. The approach provides a practical and scalable paradigm for efficient, real-time multimodal AI service deployment.

Abstract

Multimodal large language models (MLLMs) extend LLMs to handle images, videos, and audio by incorporating feature extractors and projection modules. However, these additional components -- combined with complex inference pipelines and heterogeneous workloads -- introduce significant inference overhead. Therefore, efficiently serving MLLMs remains a major challenge. Current tightly coupled serving architectures struggle to distinguish between mixed request types or adapt parallelism strategies to different inference stages, leading to increased time-to-first-token (TTFT) latency and poor resource utilization. To address this, we introduce Elastic Multimodal Parallelism (EMP), a new serving paradigm that elastically adapts to resource heterogeneity across request types and inference stages. Building upon EMP, we develop ElasticMM, an MLLM serving system that (1) separates requests into independent modality groups with dynamic resource allocation via a modality-aware load balancer; (2) decouples inference stages and enables parallelism adjustment and adaptive scaling via elastic partition scheduling; and (3) improves inference efficiency through unified multimodal prefix caching and non-blocking encoding. Experiments on diverse real-world datasets show that ElasticMM outperforms state-of-the-art (SOTA) serving systems, reducing TTFT by up to 4.2x and achieving 3.2-4.5x higher throughput while meeting service-level objectives (SLOs).

Paper Structure

This paper contains 15 sections, 8 theorems, 20 equations, 8 figures, 2 tables.

Key Result

Theorem 1

For any multimodal model $\mathcal{M}$ with inference function $f_{\mathcal{M}}$, the Elastic Multimodal Parallelism framework produces outputs identical to the standard sequential execution, i.e., where $f_{\mathcal{M}}^{EMP}$ represents the inference function under the EMP framework.

Figures (8)

  • Figure 1: MLLMs' inference overhead and workload complexity. (a) and (b) demonstrate the significant overhead introduced by MLLMs. (c) reveals the longer context in multimodal requests. Results obtained using the LLaMA3.2-11B model on the ShareGPT-4o dataset.
  • Figure 2: Framework diagram of ElasticMM. The figure illustrates a two-level scheduling framework that collaboratively enables elastic multimodal parallelism.
  • Figure 3: Illustration of the elastic scheduling space in EMP.
  • Figure 4: Example of elastic auto-scaling in three instance.
  • Figure 5: The average input and output latency of ElasticMM and baseline MLLM serving systems with the Qwen2.5-VL-7B and Llama3.2-Vision-11B under two real-world workloads. ElasticMM consistently demonstrates the lowest latency across all cases.
  • ...and 3 more figures

Theorems & Definitions (16)

  • Theorem 1: Inference Equivalence
  • proof
  • Lemma 1: Modality-Level Equivalence
  • proof
  • Lemma 2: Inference Stage Separation
  • proof
  • Lemma 3: Dynamic Parallelism Invariance
  • proof
  • Lemma 4: KV Cache Migration Fidelity
  • proof
  • ...and 6 more