Table of Contents
Fetching ...

MoDES: Accelerating Mixture-of-Experts Multimodal Large Language Models via Dynamic Expert Skipping

Yushi Huang, Zining Wang, Zhihang Yuan, Yifu Ding, Ruihao Gong, Jinyang Guo, Xianglong Liu, Jun Zhang

TL;DR

MoDES tackles the high inference cost of MoE multimodal LLMs by introducing a training-free framework that adaptively skips experts. It couples a Globally-Modulated Local Gating mechanism with a Dual-Modality Thresholding policy to respect layer-wise contribution and modality-specific token behavior, guided by an efficient frontier search to set thresholds. The method yields substantial speedups and preserves almost full accuracy across 13 benchmarks, outperforming unimodal-expert-skipping baselines especially at high skip ratios. This advances practical deployment of MoE MLLMs by enabling scalable, accurate, and hardware-friendly inference. The approach is compatible with quantization and generalizes across backbones and model series.

Abstract

Mixture-of-Experts (MoE) Multimodal large language models (MLLMs) excel at vision-language tasks, but they suffer from high computational inefficiency. To reduce inference overhead, expert skipping methods have been proposed to deactivate redundant experts based on the current input tokens. However, we find that applying these methods-originally designed for unimodal large language models (LLMs)-to MLLMs results in considerable performance degradation. This is primarily because such methods fail to account for the heterogeneous contributions of experts across MoE layers and modality-specific behaviors of tokens within these layers. Motivated by these findings, we propose MoDES, the first training-free framework that adaptively skips experts to enable efficient and accurate MoE MLLM inference. It incorporates a globally-modulated local gating (GMLG) mechanism that integrates global layer-wise importance into local routing probabilities to accurately estimate per-token expert importance. A dual-modality thresholding (DMT) method is then applied, which processes tokens from each modality separately, to derive the skipping schedule. To set the optimal thresholds, we introduce a frontier search algorithm that exploits monotonicity properties, cutting convergence time from several days to a few hours. Extensive experiments for 3 model series across 13 benchmarks demonstrate that MoDES far outperforms previous approaches. For instance, when skipping 88% experts for Qwen3-VL-MoE-30B-A3B-Instruct, the performance boost is up to 10.67% (97.33% vs. 86.66%). Furthermore, MoDES significantly enhances inference speed, improving the prefilling time by 2.16$\times$ and the decoding time by 1.26$\times$.

MoDES: Accelerating Mixture-of-Experts Multimodal Large Language Models via Dynamic Expert Skipping

TL;DR

MoDES tackles the high inference cost of MoE multimodal LLMs by introducing a training-free framework that adaptively skips experts. It couples a Globally-Modulated Local Gating mechanism with a Dual-Modality Thresholding policy to respect layer-wise contribution and modality-specific token behavior, guided by an efficient frontier search to set thresholds. The method yields substantial speedups and preserves almost full accuracy across 13 benchmarks, outperforming unimodal-expert-skipping baselines especially at high skip ratios. This advances practical deployment of MoE MLLMs by enabling scalable, accurate, and hardware-friendly inference. The approach is compatible with quantization and generalizes across backbones and model series.

Abstract

Mixture-of-Experts (MoE) Multimodal large language models (MLLMs) excel at vision-language tasks, but they suffer from high computational inefficiency. To reduce inference overhead, expert skipping methods have been proposed to deactivate redundant experts based on the current input tokens. However, we find that applying these methods-originally designed for unimodal large language models (LLMs)-to MLLMs results in considerable performance degradation. This is primarily because such methods fail to account for the heterogeneous contributions of experts across MoE layers and modality-specific behaviors of tokens within these layers. Motivated by these findings, we propose MoDES, the first training-free framework that adaptively skips experts to enable efficient and accurate MoE MLLM inference. It incorporates a globally-modulated local gating (GMLG) mechanism that integrates global layer-wise importance into local routing probabilities to accurately estimate per-token expert importance. A dual-modality thresholding (DMT) method is then applied, which processes tokens from each modality separately, to derive the skipping schedule. To set the optimal thresholds, we introduce a frontier search algorithm that exploits monotonicity properties, cutting convergence time from several days to a few hours. Extensive experiments for 3 model series across 13 benchmarks demonstrate that MoDES far outperforms previous approaches. For instance, when skipping 88% experts for Qwen3-VL-MoE-30B-A3B-Instruct, the performance boost is up to 10.67% (97.33% vs. 86.66%). Furthermore, MoDES significantly enhances inference speed, improving the prefilling time by 2.16 and the decoding time by 1.26.

Paper Structure

This paper contains 23 sections, 6 theorems, 19 equations, 12 figures, 9 tables, 1 algorithm.

Key Result

Lemma 1

For fixed $q$, define If $g$ is non-decreasing in its second argument, then $\Phi_q(p)$ is monotone in $p$. Hence, if a feasible $p$ exists, the smallest feasible index is well-defined.

Figures (12)

  • Figure 1: Average performance (%) vs.expert skipping ratios (%) across different models wang2025internvl3qwen3_vl_moe_docteam2025kimi and methods bai2025diephuang2024mixturelu2024not on 13 benchmarks (as detailed in Sec. \ref{['sec:imple']}). The left subfigure is for Kimi-VL-A3B-Instruct team2025kimi and the right subfigure is for Qwen3-VL-MoE-30B-A3B-Instruct qwen3_vl_moe_doc.
  • Figure 2: Performance on image (i.e., (a)-(b)) and video (i.e., (c)) understanding tasks across various numbers of top-$k$ routed experts applied to different layer ranges for Kimi-VL-A3B-Instruct team2025kimi. The model has 64 routed experts for each FFN within the $1$-st to the $26$-th layers, and sets $k=6$ by default.
  • Figure 3: (Left) t-SNE tsne visualization of pre-FFN text/vision tokens across all layers. (Middle) Cosine similarity between pre-FFN and post-FFN text/vision tokens across layers. (Right) Angle between text/vision tokens and weights across different FFN layers. Here, GQA hudsom2019gqa dataset is used as the model inputs, and the model is employed the same as that in Fig. \ref{['fig:motivation_global']}.
  • Figure 4: Overview of MoDES. At inference, use a text token (e.g., $\textcolor{c7}{\blacksquare}$ above) at the $l$-th FFN layer as an example. (a) We compute importance scores $s^{(l)}_i$ ($i\in\{2, 4, M\}$) by combining the offline-calibrated globally-modulated factor $\textcolor{c7}{\alpha^{(l)}}$ with the local routing probability $\pi^{(l)}_i$. These scores evaluate the top-$k$ ($k=3$) routed experts for token $\textcolor{c7}{\blacksquare}$. (b) We then apply a modality-specific threshold—$\textcolor{mygreen}{\tau_{t}}$ for text and $\textcolor{mygreen}{\tau_{v}}$ for vision—found by an efficient and effective frontier search. Experts with scores below the threshold are skipped. This method significantly reduces computation while preserving performance for MoE MLLMs. "E" and "calib set" denote the expert and $\mathcal{C}$ (Eq. (\ref{['eq:alpha']})).
  • Figure 5: (Left) $\alpha^{(l)}$ calibration time. (Right) Search time of frontier search (blue) vs.naive search (yellow). The bars/markers from left to right are for Kimi-VL-A3B-Instruct team2025kimi, Qwen3-VL-MoE-30B-A3B-Instruct qwen3_vl_moe_doc, InternVL-3.5-30B-A3B-HF wang2025internvl3, and InternVL-3.5-GPT-OSS-20B-A4B-Preview-HF wang2025internvl3.
  • ...and 7 more figures

Theorems & Definitions (12)

  • Lemma 1: Monotone feasibility in $p$
  • proof
  • Lemma 2: Monotone shift in $q$
  • proof
  • Lemma 3: Loop invariant
  • proof
  • Proposition 1: Correctness and time
  • proof
  • Lemma 4: Frontier suffices
  • proof
  • ...and 2 more