MMoE: Enhancing Multimodal Models with Mixtures of Multimodal Interaction Experts
Haofei Yu, Zhengyang Qi, Lawrence Jang, Ruslan Salakhutdinov, Louis-Philippe Morency, Paul Pu Liang
TL;DR
MMoE tackles the limitation of monolithic multimodal models by introducing a mixtures-of-experts approach that handles distinct interaction types between vision and language. Data points are categorized into redundancy, uniqueness, and synergy, with three specialized experts trained on corresponding subsets and combined via a learned fusion mechanism at inference. The method achieves state-of-the-art results on sarcasm and humor detection datasets (MUStARD and URFunny) across multiple backbone models, and shows robust improvements especially for weaker models and harder tasks. The findings highlight the practical value of interaction-aware routing and fusion in multimodal prediction, while outlining future work on finer-grained interaction types and broader modalities.
Abstract
Advances in multimodal models have greatly improved how interactions relevant to various tasks are modeled. Today's multimodal models mainly focus on the correspondence between images and text, using this for tasks like image-text matching. However, this covers only a subset of real-world interactions. Novel interactions, such as sarcasm expressed through opposing spoken words and gestures or humor expressed through utterances and tone of voice, remain challenging. In this paper, we introduce an approach to enhance multimodal models, which we call Multimodal Mixtures of Experts (MMoE). The key idea in MMoE is to train separate expert models for each type of multimodal interaction, such as redundancy present in both modalities, uniqueness in one modality, or synergy that emerges when both modalities are fused. On a sarcasm detection task (MUStARD) and a humor detection task (URFUNNY), we obtain new state-of-the-art results. MMoE is also able to be applied to various types of models to gain improvement.
