Table of Contents
Fetching ...

FastMMoE: Accelerating Multimodal Large Language Models through Dynamic Expert Activation and Routing-Aware Token Pruning

Guoyang Xia, Yifeng Ding, Fengfa Li, Lei Ren, Wei Chen, Fangxiang Feng, Xiaojie Wang

TL;DR

This work proposes Fast Multimodal Mixture-of-Experts (FastMMoE), a training-free acceleration framework for mixture-of-experts (MoE) based MLLMs, developed from a routing analysis perspective, which combines two complementary strategies: expert activation reduction for visual tokens to minimize unnecessary expert computation and routing-aware token pruning that leverages similarity in routing probability distributions.

Abstract

Multimodal large language models (MLLMs) have achieved impressive performance, but high-resolution visual inputs result in long sequences of visual tokens and substantial inference latency. Reducing redundant visual tokens is critical to ease computational/memory burdens while preserving performance, enabling MLLM deployment in resource-constrained or latency-sensitive scenarios. Current visual token pruning methods mainly rely on attention-based redundancy analysis and are tailored to dense architectures. We propose Fast Multimodal Mixture-of-Experts (FastMMoE), a training-free acceleration framework for mixture-of-experts (MoE) based MLLMs, developed from a routing analysis perspective. FastMMoE combines two complementary strategies: (i) expert activation reduction for visual tokens to minimize unnecessary expert computation; and (ii) routing-aware token pruning that leverages similarity in routing probability distributions to identify and remove highly redundant visual tokens. Experiments on large-scale MoE-MLLMs such as DeepSeek-VL2 and InternVL3.5 demonstrate that FastMMoE can reduce FLOPs by up to 55.0% while retaining approximately 95.5% of the original performance, consistently outperforming dense-model pruning baselines including FastV and SparseVLM across multiple retention rates.

FastMMoE: Accelerating Multimodal Large Language Models through Dynamic Expert Activation and Routing-Aware Token Pruning

TL;DR

This work proposes Fast Multimodal Mixture-of-Experts (FastMMoE), a training-free acceleration framework for mixture-of-experts (MoE) based MLLMs, developed from a routing analysis perspective, which combines two complementary strategies: expert activation reduction for visual tokens to minimize unnecessary expert computation and routing-aware token pruning that leverages similarity in routing probability distributions.

Abstract

Multimodal large language models (MLLMs) have achieved impressive performance, but high-resolution visual inputs result in long sequences of visual tokens and substantial inference latency. Reducing redundant visual tokens is critical to ease computational/memory burdens while preserving performance, enabling MLLM deployment in resource-constrained or latency-sensitive scenarios. Current visual token pruning methods mainly rely on attention-based redundancy analysis and are tailored to dense architectures. We propose Fast Multimodal Mixture-of-Experts (FastMMoE), a training-free acceleration framework for mixture-of-experts (MoE) based MLLMs, developed from a routing analysis perspective. FastMMoE combines two complementary strategies: (i) expert activation reduction for visual tokens to minimize unnecessary expert computation; and (ii) routing-aware token pruning that leverages similarity in routing probability distributions to identify and remove highly redundant visual tokens. Experiments on large-scale MoE-MLLMs such as DeepSeek-VL2 and InternVL3.5 demonstrate that FastMMoE can reduce FLOPs by up to 55.0% while retaining approximately 95.5% of the original performance, consistently outperforming dense-model pruning baselines including FastV and SparseVLM across multiple retention rates.

Paper Structure

This paper contains 68 sections, 43 equations, 22 figures, 10 tables.

Figures (22)

  • Figure 1: Similarity of adjacent tokens routing probability (Eq.\ref{['con:adjacent_sim']}) across layers in InternVL3.5.
  • Figure 2: FastMMoE overview. (Left) Vision-token expert activation reduction: from layer $l_v$ onward, vision tokens (blue) activate fewer experts (red dashed arrows), while text tokens (green) keep full routing. (Right) Routing-aware token pruning: vision tokens are grouped into sliding windows ($W$), with routing-probability similarity $S_{i,v}$ and attention importance $\bar{A}_{i,v}$ combined into redundancy score $C_v$. High-redundancy windows are merged, and low-importance high-redundancy windows are pruned.
  • Figure 3: Average Performance Heatmap for InternVL3.5. We test the different choices of $l_v,K_v$ to reduce activated experts for vision tokens. More detailes are provided in the Appendix \ref{['sec:test_results']}.
  • Figure 4: Average Performance Heatmap for DeepSeek-VL2. More detailes are provided in the Appendix \ref{['sec:test_results']}.
  • Figure 5: Magnitude stability score $V_m$ for InternVL3.5 across layers. Higher $V_m$ means tighter magnitude concentration among expert outputs.
  • ...and 17 more figures