Table of Contents
Fetching ...

MoPE: Mixture of Prompt Experts for Parameter-Efficient and Scalable Multimodal Fusion

Ruixiang Jiang, Lingbo Liu, Changwen Chen

TL;DR

The paper tackles the scalability and expressiveness limits of prompt-based multimodal fusion by introducing MoPE, a Mixture of Prompt Experts that generates instance-specific prompts via a multimodal router and a set of prompt experts. By decomposing the prompt into static, dynamic, and mapped components and routing per instance, MoPE achieves high expressiveness without increasing prompt length, delivering state-of-the-art results on six datasets across four modalities with only 0.8% of trainable parameters. Regularization techniques, including frozen routing embeddings and an importance loss, promote expert specialization and interpretable prompting. The approach demonstrates strong data scalability, architectural modularity, and compatibility with heterogeneous backbones, offering a practical and efficient path to robust multimodal fusion.

Abstract

Despite the demonstrated parameter efficiency of prompt-based fusion, its limited adaptivity and expressiveness hinder its effectiveness for multimodal applications at scale. In this paper, we present the first comprehensive study addressing these limitations. Our key motivation is to ``divide and conquer'' the vanilla prompt, traditionally shared across all instances, by generating instance-specific prompts. Specifically, we propose the Mixture of Prompt Experts (MoPE), a framework that significantly enhances prompt adaptivity and expressiveness by dynamically generating instance-specific prompts. MoPE leverages multimodal pairings as additional evidence, allowing the model to adaptively select optimal prompts tailored to each individual instance. Unlike traditional prompt-fusion methods, which encounter scalability bottlenecks when optimizing long unified prompts, MoPE maintains fixed prompt length while effectively scaling the number of specialized experts. Moreover, we investigate regularization terms to encourage expert specialization, resulting in highly adaptive and interpretable prompting. MoPE fundamentally changes the scaling dynamic, unlocking greater expressiveness and adaptability to complex multimodal relationships, enabling the model to selectively attend to task-relevant sub-sequences based on instance-specific multimodal input. Extensive experiments across six multimodal datasets spanning four modalities demonstrate state-of-the-art performance for multimodal fusion, matching or surpassing the performance of fine-tuning while requiring only 0.8% of the trainable parameters. Code is available: https://github.com/songrise/MoPE.

MoPE: Mixture of Prompt Experts for Parameter-Efficient and Scalable Multimodal Fusion

TL;DR

The paper tackles the scalability and expressiveness limits of prompt-based multimodal fusion by introducing MoPE, a Mixture of Prompt Experts that generates instance-specific prompts via a multimodal router and a set of prompt experts. By decomposing the prompt into static, dynamic, and mapped components and routing per instance, MoPE achieves high expressiveness without increasing prompt length, delivering state-of-the-art results on six datasets across four modalities with only 0.8% of trainable parameters. Regularization techniques, including frozen routing embeddings and an importance loss, promote expert specialization and interpretable prompting. The approach demonstrates strong data scalability, architectural modularity, and compatibility with heterogeneous backbones, offering a practical and efficient path to robust multimodal fusion.

Abstract

Despite the demonstrated parameter efficiency of prompt-based fusion, its limited adaptivity and expressiveness hinder its effectiveness for multimodal applications at scale. In this paper, we present the first comprehensive study addressing these limitations. Our key motivation is to ``divide and conquer'' the vanilla prompt, traditionally shared across all instances, by generating instance-specific prompts. Specifically, we propose the Mixture of Prompt Experts (MoPE), a framework that significantly enhances prompt adaptivity and expressiveness by dynamically generating instance-specific prompts. MoPE leverages multimodal pairings as additional evidence, allowing the model to adaptively select optimal prompts tailored to each individual instance. Unlike traditional prompt-fusion methods, which encounter scalability bottlenecks when optimizing long unified prompts, MoPE maintains fixed prompt length while effectively scaling the number of specialized experts. Moreover, we investigate regularization terms to encourage expert specialization, resulting in highly adaptive and interpretable prompting. MoPE fundamentally changes the scaling dynamic, unlocking greater expressiveness and adaptability to complex multimodal relationships, enabling the model to selectively attend to task-relevant sub-sequences based on instance-specific multimodal input. Extensive experiments across six multimodal datasets spanning four modalities demonstrate state-of-the-art performance for multimodal fusion, matching or surpassing the performance of fine-tuning while requiring only 0.8% of the trainable parameters. Code is available: https://github.com/songrise/MoPE.
Paper Structure (34 sections, 2 theorems, 28 equations, 13 figures, 9 tables)

This paper contains 34 sections, 2 theorems, 28 equations, 13 figures, 9 tables.

Key Result

Theorem 1

Under the above setup and assumption, for any global prompt $\mathbf{P}_{\text{shared}}\in \mathbb{P}$ (i.e., one that is used for all instances), it holds that which implies that the accumulated error with global prompt must be higher than the sum of the instance-specific minimal errors.

Figures (13)

  • Figure 1: High-level motivation of MoPE-based multimodal fusion. (a) Vanilla prompt fusion uses a non-adaptive global prompt (gray rectangles) for all inputs. It is challenging to fully capture the per-instance shift, resulting in suboptimal performance. (b) MoPE achieves instance-wise adaptivity by routing the most effective prompts for each instance. Its specialized expert effectively learns to divide the problem space into concept clusters. The prompt is then generated according to the predicted cluster, leading to a more expressive representation and superior performance.
  • Figure 2: Architecture overview.(a) A sequential fusion pipeline is employed, where the representation from complementary modality $\mathbb{Y}$ guides the prompting of modality $\mathbb{X}$. Three types of prompts are used at each layer, which are concatenated to the token embeddings. (b) MoPE is introduced to generate the dynamic prompt, which routes the most effective dynamic prompt based on multimodal representations. Here, $(x_1,y_1), (x_2,y_2)$ indicate two input pairs; (c) Inside the multimodal router, we project the representation from each modality to get cross-modal and inter-modal embeddings. The concatenated embeddings $\mathbf{q}$ are used to query the routing embedding $\mathbf{k}_{1\dots k}$ paired with each expert for routing score calculation. Better viewed with color.
  • Figure 3: Illustration of expert specialization in MoPE.Each expert learns to specialized in a group of concepts, akin to a soft cluster in the semantic space. For instance, Expert 3 activates for prompts related to children and playful activities, while Expert 12 activates for scenes involving crowds and public spaces.
  • Figure 4: Visualization of attention maps produced by different prompting methods.Compared with the method with vanilla prompt that produces a similar attention map for each instance, the attention map produced by MoPE is adaptive. It grounds more attentively to the most-related tokens according to the text query, fulfilling the requirement of top-down attention. Better viewed with color, warm colors indicate a higher attention score.
  • Figure 5: More experts v.s. longer prompts. We compare increasing the number of experts, $k$, versus lengthen prompt, $l$. Expert-scaling consistently outperforms length-scaling, exhibiting a linear growth trend. Conversely, length-scaling suffers from deterioration with long prompts.
  • ...and 8 more figures

Theorems & Definitions (4)

  • Theorem 1: No Global Prompt Achieves Instance-Optimal Error Simultaneously
  • proof
  • Theorem 2: Improved Adaptivity of MoPE
  • proof