FuseMoE: Mixture-of-Experts Transformers for Fleximodal Fusion
Xing Han, Huy Nguyen, Carl Harris, Nhat Ho, Suchi Saria
TL;DR
FuseMoE presents a scalable, flexible multimodal fusion framework for FlexiModal data by integrating a sparsely gated MoE backbone with a novel Laplace gating, per-modality routers, and an irregularity encoder. Theoretical analysis shows superior convergence properties for density and parameter estimation under Laplace gating compared to Softmax, and empirical results demonstrate gains across medical, vision, and sentiment benchmarks, especially under missing modalities and irregular sampling. The combination of entropy-regularized routing, modular modality handling, and robust irregularity encoding yields improved predictive performance while maintaining scalability across many modalities. This work offers a practical, theoretically-supported pathway for robust multimodal fusion in real-world settings like EHRs and multimedia analysis.
Abstract
As machine learning models in critical fields increasingly grapple with multimodal data, they face the dual challenges of handling a wide array of modalities, often incomplete due to missing elements, and the temporal irregularity and sparsity of collected samples. Successfully leveraging this complex data, while overcoming the scarcity of high-quality training samples, is key to improving these models' predictive performance. We introduce ``FuseMoE'', a mixture-of-experts framework incorporated with an innovative gating function. Designed to integrate a diverse number of modalities, FuseMoE is effective in managing scenarios with missing modalities and irregularly sampled data trajectories. Theoretically, our unique gating function contributes to enhanced convergence rates, leading to better performance in multiple downstream tasks. The practical utility of FuseMoE in the real world is validated by a diverse set of challenging prediction tasks.
