Guiding Mixture-of-Experts with Temporal Multimodal Interactions
Xing Han, Hsing-Huan Chung, Joydeep Ghosh, Paul Pu Liang, Suchi Saria
TL;DR
Time-MoE addresses the gap in MoE routing by incorporating temporal multimodal interaction dynamics. It formalizes interaction flow via multi-source directed information $DI(\tau)$ and its decomposition into $R(\tau)$, $U_{1}(\tau)$, $U_{2}(\tau)$, and $S(\tau)$, estimated efficiently with a multi-scale $BATCH$ approach. The framework introduces an RUS-Aware Router that routes tokens based on redundancy, uniqueness, and synergy cues, aided by auxiliary losses and a GRU for temporal context. Across six diverse multimodal benchmarks, Time-MoE achieves state-of-the-art performance and yields more interpretable routing patterns, demonstrating the practical value of leveraging temporal interactions for expert specialization.
Abstract
Mixture-of-Experts (MoE) architectures have become pivotal for large-scale multimodal models. However, their routing mechanisms typically overlook the informative, time-varying interaction dynamics between modalities. This limitation hinders expert specialization, as the model cannot explicitly leverage intrinsic modality relationships for effective reasoning. To address this, we propose a novel framework that guides MoE routing using quantified temporal interaction. A multimodal interaction-aware router learns to dispatch tokens to experts based on the nature of their interactions. This dynamic routing encourages experts to acquire generalizable interaction-processing skills rather than merely learning task-specific features. Our framework builds on a new formulation of temporal multimodal interaction dynamics, which are used to guide expert routing. We first demonstrate that these temporal multimodal interactions reveal meaningful patterns across applications, and then show how they can be leveraged to improve both the design and performance of MoE-based models. Comprehensive experiments on challenging multimodal benchmarks validate our approach, demonstrating both enhanced performance and improved interpretability.
