Table of Contents
Fetching ...

Guiding Mixture-of-Experts with Temporal Multimodal Interactions

Xing Han, Hsing-Huan Chung, Joydeep Ghosh, Paul Pu Liang, Suchi Saria

TL;DR

Time-MoE addresses the gap in MoE routing by incorporating temporal multimodal interaction dynamics. It formalizes interaction flow via multi-source directed information $DI(\tau)$ and its decomposition into $R(\tau)$, $U_{1}(\tau)$, $U_{2}(\tau)$, and $S(\tau)$, estimated efficiently with a multi-scale $BATCH$ approach. The framework introduces an RUS-Aware Router that routes tokens based on redundancy, uniqueness, and synergy cues, aided by auxiliary losses and a GRU for temporal context. Across six diverse multimodal benchmarks, Time-MoE achieves state-of-the-art performance and yields more interpretable routing patterns, demonstrating the practical value of leveraging temporal interactions for expert specialization.

Abstract

Mixture-of-Experts (MoE) architectures have become pivotal for large-scale multimodal models. However, their routing mechanisms typically overlook the informative, time-varying interaction dynamics between modalities. This limitation hinders expert specialization, as the model cannot explicitly leverage intrinsic modality relationships for effective reasoning. To address this, we propose a novel framework that guides MoE routing using quantified temporal interaction. A multimodal interaction-aware router learns to dispatch tokens to experts based on the nature of their interactions. This dynamic routing encourages experts to acquire generalizable interaction-processing skills rather than merely learning task-specific features. Our framework builds on a new formulation of temporal multimodal interaction dynamics, which are used to guide expert routing. We first demonstrate that these temporal multimodal interactions reveal meaningful patterns across applications, and then show how they can be leveraged to improve both the design and performance of MoE-based models. Comprehensive experiments on challenging multimodal benchmarks validate our approach, demonstrating both enhanced performance and improved interpretability.

Guiding Mixture-of-Experts with Temporal Multimodal Interactions

TL;DR

Time-MoE addresses the gap in MoE routing by incorporating temporal multimodal interaction dynamics. It formalizes interaction flow via multi-source directed information and its decomposition into , , , and , estimated efficiently with a multi-scale approach. The framework introduces an RUS-Aware Router that routes tokens based on redundancy, uniqueness, and synergy cues, aided by auxiliary losses and a GRU for temporal context. Across six diverse multimodal benchmarks, Time-MoE achieves state-of-the-art performance and yields more interpretable routing patterns, demonstrating the practical value of leveraging temporal interactions for expert specialization.

Abstract

Mixture-of-Experts (MoE) architectures have become pivotal for large-scale multimodal models. However, their routing mechanisms typically overlook the informative, time-varying interaction dynamics between modalities. This limitation hinders expert specialization, as the model cannot explicitly leverage intrinsic modality relationships for effective reasoning. To address this, we propose a novel framework that guides MoE routing using quantified temporal interaction. A multimodal interaction-aware router learns to dispatch tokens to experts based on the nature of their interactions. This dynamic routing encourages experts to acquire generalizable interaction-processing skills rather than merely learning task-specific features. Our framework builds on a new formulation of temporal multimodal interaction dynamics, which are used to guide expert routing. We first demonstrate that these temporal multimodal interactions reveal meaningful patterns across applications, and then show how they can be leveraged to improve both the design and performance of MoE-based models. Comprehensive experiments on challenging multimodal benchmarks validate our approach, demonstrating both enhanced performance and improved interpretability.

Paper Structure

This paper contains 23 sections, 10 equations, 8 figures, 10 tables, 3 algorithms.

Figures (8)

  • Figure 1: Overview of Time-MoE. The left panel illustrates the overall architecture, where multimodal inputs are processed through $N$ stacked encoder layers composed of alternating Transformer and MoE blocks. The core innovation of Time-MoE lies in the MoE layers, detailed on the right. The RUS-aware router is the essential part, which leverages temporal multimodal interactions to guide the routing of token embeddings across different time lags. The router determines, based on interaction dynamics, which modality pairs should (or should not) be routed to the same expert, thereby enabling more principled and interpretable expert specialization. As an example, $m_2$ (yellow) and $m_3$ (green) exhibit high redundancy according to their temporal RUS values; therefore, the RUS-aware router is more likely to assign them to the same expert (yellow and green arrows).
  • Figure 2: Decomposed directed information components across time lag $\tau$.
  • Figure 3: Schematic overview of the multi-scale BATCH estimator. The procedure consists of four stages: (1) encoding empirical datasets $\mathcal{D}_\tau = \{(X_{1}^{n-\tau}, X_{2}^{n-\tau}, Y^n)\}_{\tau=1}^3$ with shared encoders $g$, with each lag $\tau$ further embedded by an encoder $e$ to produce $e(\tau)$, (2) training lag-conditioned discriminators $D_{1,\theta}, D_{2,\theta}, D_{12,\theta}$ to estimate $\hat{P}$, together with MLPs to generate embeddings for $\hat{q}$ at each $\tau$, (3) updating the alignment tensor to enforce marginal distribution matching between $Q_{\tau}$ and $\hat{P}$, yielding the optimized distribution $Q_{\tau}^*$, and (4) decomposing the resulting estimates of $Q_{\tau}^*$ and $\hat{P}$ into redundancy, uniqueness, and synergy sequence across all time lags.
  • Figure 4: Model Structure of RUS-Aware Router.
  • Figure 5: Insights from temporal RUS values across different applications include: (a) insulin and furosemide exhibit a strong synergistic effect at the time of administration, while insulin’s unique effect becomes more pronounced later; (b) as time progresses after furosemide administration, its physiological impact increases; (c) in activity recognition, chest and hand motion display coupled movements during locomotion, reflecting strong redundancy; and (d) in physiological monitoring, ECG and respiration signals from one second prior provide better predictions of current chest temperature, capturing the natural response delay to stimuli.
  • ...and 3 more figures

Theorems & Definitions (1)

  • Definition 1: Multi-Source Directed Information