FuseMoE: Mixture-of-Experts Transformers for Fleximodal Fusion

Xing Han; Huy Nguyen; Carl Harris; Nhat Ho; Suchi Saria

FuseMoE: Mixture-of-Experts Transformers for Fleximodal Fusion

Xing Han, Huy Nguyen, Carl Harris, Nhat Ho, Suchi Saria

TL;DR

FuseMoE presents a scalable, flexible multimodal fusion framework for FlexiModal data by integrating a sparsely gated MoE backbone with a novel Laplace gating, per-modality routers, and an irregularity encoder. Theoretical analysis shows superior convergence properties for density and parameter estimation under Laplace gating compared to Softmax, and empirical results demonstrate gains across medical, vision, and sentiment benchmarks, especially under missing modalities and irregular sampling. The combination of entropy-regularized routing, modular modality handling, and robust irregularity encoding yields improved predictive performance while maintaining scalability across many modalities. This work offers a practical, theoretically-supported pathway for robust multimodal fusion in real-world settings like EHRs and multimedia analysis.

Abstract

As machine learning models in critical fields increasingly grapple with multimodal data, they face the dual challenges of handling a wide array of modalities, often incomplete due to missing elements, and the temporal irregularity and sparsity of collected samples. Successfully leveraging this complex data, while overcoming the scarcity of high-quality training samples, is key to improving these models' predictive performance. We introduce ``FuseMoE'', a mixture-of-experts framework incorporated with an innovative gating function. Designed to integrate a diverse number of modalities, FuseMoE is effective in managing scenarios with missing modalities and irregularly sampled data trajectories. Theoretically, our unique gating function contributes to enhanced convergence rates, leading to better performance in multiple downstream tasks. The practical utility of FuseMoE in the real world is validated by a diverse set of challenging prediction tasks.

FuseMoE: Mixture-of-Experts Transformers for Fleximodal Fusion

TL;DR

Abstract

Paper Structure (86 sections, 7 theorems, 92 equations, 13 figures, 10 tables)

This paper contains 86 sections, 7 theorems, 92 equations, 13 figures, 10 tables.

Introduction
Contributions
FuseMoE: Enhance Predictive Performance for FlexiModal Data
Sparse MoE Backbone
Modality and Irregularity Encoder
MoE Fusion Layer
Router Design Study
Missing Modalities
Theoretical Contribution
Experiments
Overview
Main Results
CMU-MOSI and MOSEI Datasets
CIFAR-10 Dataset
MIMIC-IV and PAM Datasets
...and 71 more sections

Key Result

Theorem 3.1

The density estimation $p_{\widehat{G}_n}(Y|X)$ converges to the true density $p_{G_*}(Y|X)$ under the Total Variation distance at the following rate:

Figures (13)

Figure 1: An example of addressing the challenge of FlexiModal Data: patients in ICUs often have extensive and irregular health status measurements over time; patients with milder conditions only require monitoring across fewer categories. FuseMoE is adept at handling inputs featuring any combination of modalities, including those with missing elements. It starts by encoding inputs using modality-specific feature extractors, followed by employing a multi-time attention mechanism shukla2021multi to address temporal irregularities. The core of FuseMoE lies the MoE Fusion Layer, where a routing mechanism is trained to categorize multimodal inputs and direct them to the appropriate combinations of MLPs. The outputs from these MLPs are weighted through a gating function, resulting in fused embeddings, which are subsequently utilized for further processing.
Figure 2: We present three exemplary designs of the Top-$K$ router for effective multimodal fusion, considering an input scenario with three modalities: Time-Series (TS), Text (TXT), and images (IMG). (a) The joint router design utilizes a concatenated embedding of all modalities, directing this combined input to selected experts. (b) In the modality-specific router design, each modality's embedding is independently assigned to a shared pool of experts. (c) The third design variant also uses modality-specific routers but assigns each modality's embedding to separate pools of experts, each pool uniquely tailored to process a specific modality type.
Figure 3: Log-log scaled plots illustrating simulation results under the exact-specified (left) and the over-specified settings (right). The orange curves depict the mean discrepancy between the MLE $\widehat{G}_n$ and the true mixing measure $G_*$, accompanied by error bars signifying two empirical standard deviations. Additionally, the gray dash-dotted line represents the least-squares fitted linear regression line for these data points. Finally, the loss functions $\mathcal{D}_1$ and $\mathcal{D}_2$ are defined in equations equation \ref{['eq:Voronoi_loss']} and equation \ref{['eq:Voronoi_loss_over']}, respectively. See Appendix \ref{['appendix:numerical_experiments']} for the experimental details.
Figure 4: (a) The Laplace gating mechanism enhances CIFAR-10 classification when integrated into Vision-MoE riquelme2021scaling. We employed Vision Transformer (ViT) dosovitskiy2020image and ViT-small as the backbone models and selectively replaced their FFN layers with MoE layers; (b) FuseMoE improves prediction on PAM dataset over baseline time series models; (c) Per-modality routers and the entropy loss $\mathcal{E}$ mitigate the impact of missing modalities.
Figure 5: Schematic of tasks of interest. Plotted are example vitals/labs, radiological notes, X-rays, and ECGs sampled over the course of a patient's ICU stay. The first three rows represent example observations from a single modality consisting of three irregularly sampled vital signs (HR, BP), and lab values (Glucose). The following three rows represent irregularly sampled radiological notes, X-rays, and ECGs. Opaque shapes denote observations falling within the observation window (i.e., observations that are used to generate predictions), while translucent shapes are not used to generate predictions. For the 48-IHM task, we use the first 48 hours of observations to predict death at any time during the ICU stay. For the LOS task, we use the first 48 hours of observations to predict whether the patient will be discharged (alive) during the following 48 hours. And in the phenotyping task (PHE), we use all observations to predict one of 25 critical care conditions.
...and 8 more figures

Theorems & Definitions (8)

Theorem 3.1: Density estimation
Theorem 3.2: Parameter Estimation
Theorem J.1
Theorem J.2: Exact-specified setting
Lemma K.1
Lemma K.2: Theorem 7.4, Vandegeer-2000
Lemma L.1
proof : Proof of Lemma \ref{['prop:identifiable']}

FuseMoE: Mixture-of-Experts Transformers for Fleximodal Fusion

TL;DR

Abstract

FuseMoE: Mixture-of-Experts Transformers for Fleximodal Fusion

Authors

TL;DR

Abstract

Table of Contents

Key Result

Figures (13)

Theorems & Definitions (8)