Table of Contents
Fetching ...

Cost-Efficient Multimodal LLM Inference via Cross-Tier GPU Heterogeneity

Donglin Yu

Abstract

Multimodal large language model (MLLM) inference splits into two phases with opposing hardware demands: vision encoding is compute-bound, while language generation is memory-bandwidth-bound. We show that under standard transformer KV caching, the modality boundary (between vision encoder and language model) minimizes cross-device transfer among all partition points that preserve standard stage-based execution. Partitioning here reduces transfer complexity from $O(L * s_ctx)$ bytes (GB-scale KV caches under stage-level disaggregation) to $O(N_v * d)$ bytes (MB-scale embeddings), an O(L) reduction where L is the transformer depth. The result holds across attention mechanisms (MHA/GQA), dynamic vision resolutions, and model scales, and the advantage grows as models deepen. A direct implication is that existing stage-level disaggregation systems are constrained to high-bandwidth interconnects (e.g., NVLink), whereas modality-level disaggregation enables cross-tier heterogeneous serving over commodity PCIe. A closed-form cost model shows that heterogeneous deployment is cost-optimal under phase-separable workloads (predicts 31.4% savings; observed 40.6%). We build HeteroServe, a phase-aware runtime with modality-level partitioning and cross-tier scheduling, and evaluate it on LLaVA-1.5-7B and Qwen2.5-VL against vLLM v0.3.0. On identical 4xA100 hardware, engine optimizations raise throughput by up to 54%. Under a fixed budget, a heterogeneous cluster (\$38k) improves Tokens/\$ by 37% over a homogeneous baseline (\$64k) without degrading latency.

Cost-Efficient Multimodal LLM Inference via Cross-Tier GPU Heterogeneity

Abstract

Multimodal large language model (MLLM) inference splits into two phases with opposing hardware demands: vision encoding is compute-bound, while language generation is memory-bandwidth-bound. We show that under standard transformer KV caching, the modality boundary (between vision encoder and language model) minimizes cross-device transfer among all partition points that preserve standard stage-based execution. Partitioning here reduces transfer complexity from bytes (GB-scale KV caches under stage-level disaggregation) to bytes (MB-scale embeddings), an O(L) reduction where L is the transformer depth. The result holds across attention mechanisms (MHA/GQA), dynamic vision resolutions, and model scales, and the advantage grows as models deepen. A direct implication is that existing stage-level disaggregation systems are constrained to high-bandwidth interconnects (e.g., NVLink), whereas modality-level disaggregation enables cross-tier heterogeneous serving over commodity PCIe. A closed-form cost model shows that heterogeneous deployment is cost-optimal under phase-separable workloads (predicts 31.4% savings; observed 40.6%). We build HeteroServe, a phase-aware runtime with modality-level partitioning and cross-tier scheduling, and evaluate it on LLaVA-1.5-7B and Qwen2.5-VL against vLLM v0.3.0. On identical 4xA100 hardware, engine optimizations raise throughput by up to 54%. Under a fixed budget, a heterogeneous cluster (\ by 37% over a homogeneous baseline (\$64k) without degrading latency.
Paper Structure (31 sections, 1 theorem, 9 equations, 3 figures, 8 tables)

This paper contains 31 sections, 1 theorem, 9 equations, 3 figures, 8 tables.

Key Result

Theorem 1

For any transformer-based MLLM with $L$ layers, $n_{\text{kv}}$ KV heads of dimension $d_h$, hidden dimension $d = n_h \cdot d_h$, visual token count $N_v$, and text context $s_{\text{text}}$, under standard KV caching semantics (i.e., no activation recomputation, KV offloading, or speculative decod where $s_{\text{ctx}} = N_v + s_{\text{text}}$. Under MHA ($n_{\text{kv}} = n_h$), this simplifies

Figures (3)

  • Figure 1: (a) Cost saving $\Delta_{\text{cost}}$ (Eq. \ref{['eq:saving']}) as a function of the vision-to-language time ratio $\rho$ for different price ratios $\gamma$. The RTX 4090/A100 operating point ($\gamma{=}0.19$, $\rho{=}0.63$) is marked. (b) Transfer ratio $R$ (Eq. \ref{['eq:ratio']}) across model depths, confirming that modality-level disaggregation becomes increasingly advantageous for larger models.
  • Figure 2: HeteroServe architecture. Consumer GPUs (RTX 4090) handle vision encoding and transfer lightweight visual embeddings (${\sim}4.5$ MB) via PCIe to datacenter GPUs (A100), which perform language generation. When the consumer pool is idle, cross-type work stealing allows consumer GPUs to assist with language decoding using pre-loaded LLM weights.
  • Figure 3: Timeline of consumer GPU activity. After completing vision encoding, the consumer GPU transfers embeddings and then steals language generation tasks until new vision requests arrive. Pre-loaded LLM weights enable sub-100 ms role switching.

Theorems & Definitions (2)

  • Theorem 1: Transfer Optimality of the Modality Boundary
  • proof