MMoE: Enhancing Multimodal Models with Mixtures of Multimodal Interaction Experts

Haofei Yu; Zhengyang Qi; Lawrence Jang; Ruslan Salakhutdinov; Louis-Philippe Morency; Paul Pu Liang

MMoE: Enhancing Multimodal Models with Mixtures of Multimodal Interaction Experts

Haofei Yu, Zhengyang Qi, Lawrence Jang, Ruslan Salakhutdinov, Louis-Philippe Morency, Paul Pu Liang

TL;DR

MMoE tackles the limitation of monolithic multimodal models by introducing a mixtures-of-experts approach that handles distinct interaction types between vision and language. Data points are categorized into redundancy, uniqueness, and synergy, with three specialized experts trained on corresponding subsets and combined via a learned fusion mechanism at inference. The method achieves state-of-the-art results on sarcasm and humor detection datasets (MUStARD and URFunny) across multiple backbone models, and shows robust improvements especially for weaker models and harder tasks. The findings highlight the practical value of interaction-aware routing and fusion in multimodal prediction, while outlining future work on finer-grained interaction types and broader modalities.

Abstract

Advances in multimodal models have greatly improved how interactions relevant to various tasks are modeled. Today's multimodal models mainly focus on the correspondence between images and text, using this for tasks like image-text matching. However, this covers only a subset of real-world interactions. Novel interactions, such as sarcasm expressed through opposing spoken words and gestures or humor expressed through utterances and tone of voice, remain challenging. In this paper, we introduce an approach to enhance multimodal models, which we call Multimodal Mixtures of Experts (MMoE). The key idea in MMoE is to train separate expert models for each type of multimodal interaction, such as redundancy present in both modalities, uniqueness in one modality, or synergy that emerges when both modalities are fused. On a sarcasm detection task (MUStARD) and a humor detection task (URFUNNY), we obtain new state-of-the-art results. MMoE is also able to be applied to various types of models to gain improvement.

MMoE: Enhancing Multimodal Models with Mixtures of Multimodal Interaction Experts

TL;DR

Abstract

Paper Structure (56 sections, 1 theorem, 4 equations, 8 figures, 15 tables, 1 algorithm)

This paper contains 56 sections, 1 theorem, 4 equations, 8 figures, 15 tables, 1 algorithm.

Introduction
Related Work
Multimodal Interactions
Multimodal Language Models
Ensembles and Mixtures of Experts
Multimodal Mixtures of Experts
Categorizing Multimodal Interactions
Training Expert Models for Each Multimodal Interaction Type
Inference with Mixtures of Expert Models
Experiments
Experimental Setup
Model
Multimodal prediction task
Main Results
Overall comparison with state-of-the-art
...and 41 more sections

Key Result

Theorem 1

Let $y_1$ and $y_2$ denote the predictions from unimodal classifiers, and let $y_m$ represent the multimodal prediction from multimodal models. The interaction discrepancy between the predictions is defined as: where $\delta(\cdot, \cdot)$ denotes the discrepancy function between two predictions. The categorization is then described as follows:

Figures (8)

Figure 1: A single model cannot handle all types of multimodal interactions well for hard multimodal prediction tasks. For example, to predict sarcasm, ALBEF can have $\sim$89% F1 when modalities contain redundant information (e.g., both the text and the image are sarcastic), but drops to $\sim$24% F1 when there is synergy between modalities (e.g., the image shows a cold winter scene and the text says it is a happy spring, indicating the user's sarcastic intent about the weather).
Figure 2: We classify one multimodal dataset into three subsets based on their multimodal interactions: (1) Redundancy (R), when both modalities provide the same prediction, (2) Uniqueness (U), when two modalities make different predictions, of which one of them is correct, (3) Synergy (S), when the ground-truth multimodal labels do not agree with either of unimodal predictions. $y_1$ represents the prediction based on vision modality, $y_2$ represents the prediction from text modality, and $y_m^{*}$ represents the ground-truth label. {$A$, $B$, $C$} represents classes.
Figure 3: MMoE training: Each multimodal datapoint is categorized based on its multimodal interaction and used to train an expert model tailored only for that interaction.
Figure 4: MMoE inference: We infer which multimodal interaction a test datapoint requires and use a soft weighted fusion over on the outputs from multiple expert models.
Figure 5: MMoE applicability: MMoE can be used as a drop-in method to the training of fusion-based VLMs, multimodal extended LLMs, and image-captioned LLMs.
...and 3 more figures

Theorems & Definitions (2)

Definition 1: Prediction Discrepancy Function
Theorem 1: Multimodal Interaction-based Categorization

MMoE: Enhancing Multimodal Models with Mixtures of Multimodal Interaction Experts

TL;DR

Abstract

MMoE: Enhancing Multimodal Models with Mixtures of Multimodal Interaction Experts

Authors

TL;DR

Abstract

Table of Contents

Key Result

Figures (8)

Theorems & Definitions (2)