Table of Contents
Fetching ...

R2-T2: Re-Routing in Test-Time for Multimodal Mixture-of-Experts

Zhongyang Li, Ziyue Li, Tianyi Zhou

TL;DR

R2-T2 introduces a training-free test-time re-routing mechanism for multimodal MoE models, aiming to correct suboptimal routing weights without retraining. It formulates three strategies—Neighborhood Gradient Descent, Kernel Regression, and Mode Finding—that adapt routing using a reference set of correctly predicted samples in a task-embedding space. Across MoAI-7B and MoVA-7B, R2-T2 achieves substantial gains on eight benchmarks, approaching oracle upper bounds while incurring moderate computation. This approach demonstrates that dynamic, test-time routing refinement can significantly improve cross-modal reasoning without expensive fine-tuning, enhancing robustness and generalization for smaller, efficient MoE models.

Abstract

In large multimodal models (LMMs), the perception of non-language modalities (e.g., visual representations) is usually not on par with the large language models (LLMs)' powerful reasoning capabilities, deterring LMMs' performance on challenging downstream tasks. This weakness has been recently mitigated by replacing the vision encoder with a mixture-of-experts (MoE), which provides rich, multi-granularity, and diverse representations required by diverse downstream tasks. The performance of multimodal MoE largely depends on its router, which reweights and mixes the representations of different experts for each input. However, we find that the end-to-end trained router does not always produce the optimal routing weights for every test sample. To bridge the gap, we propose a novel and efficient method "Re-Routing in Test-Time (R2-T2)" that locally optimizes the vector of routing weights in test-time by moving it toward those vectors of the correctly predicted samples in a neighborhood of the test sample. We propose three R2-T2 strategies with different optimization objectives and neighbor-search spaces. R2-T2 consistently and greatly improves state-of-the-art LMMs' performance on challenging benchmarks of diverse tasks, without training any base-model parameters.

R2-T2: Re-Routing in Test-Time for Multimodal Mixture-of-Experts

TL;DR

R2-T2 introduces a training-free test-time re-routing mechanism for multimodal MoE models, aiming to correct suboptimal routing weights without retraining. It formulates three strategies—Neighborhood Gradient Descent, Kernel Regression, and Mode Finding—that adapt routing using a reference set of correctly predicted samples in a task-embedding space. Across MoAI-7B and MoVA-7B, R2-T2 achieves substantial gains on eight benchmarks, approaching oracle upper bounds while incurring moderate computation. This approach demonstrates that dynamic, test-time routing refinement can significantly improve cross-modal reasoning without expensive fine-tuning, enhancing robustness and generalization for smaller, efficient MoE models.

Abstract

In large multimodal models (LMMs), the perception of non-language modalities (e.g., visual representations) is usually not on par with the large language models (LLMs)' powerful reasoning capabilities, deterring LMMs' performance on challenging downstream tasks. This weakness has been recently mitigated by replacing the vision encoder with a mixture-of-experts (MoE), which provides rich, multi-granularity, and diverse representations required by diverse downstream tasks. The performance of multimodal MoE largely depends on its router, which reweights and mixes the representations of different experts for each input. However, we find that the end-to-end trained router does not always produce the optimal routing weights for every test sample. To bridge the gap, we propose a novel and efficient method "Re-Routing in Test-Time (R2-T2)" that locally optimizes the vector of routing weights in test-time by moving it toward those vectors of the correctly predicted samples in a neighborhood of the test sample. We propose three R2-T2 strategies with different optimization objectives and neighbor-search spaces. R2-T2 consistently and greatly improves state-of-the-art LMMs' performance on challenging benchmarks of diverse tasks, without training any base-model parameters.

Paper Structure

This paper contains 25 sections, 9 equations, 14 figures, 13 tables.

Figures (14)

  • Figure 1: R2-T2 applied to MoAI-7B compared against 7/8/13B VLMs on 9 benchmarks. R2-T2 significantly enhances performance of the 7B base MoE model, surpassing a recent 13B VLM.
  • Figure 2: An example of how R2-T2 optimizes the routing weights. Given the test sample, it finds $k$NN in the reference set of correctly predicted samples with similar questions. In the example, the test sample requires reasoning about positional relationships. R2-T2 identifies relevant kNN samples, adjusting the top-1 expert from $\mathbf{I}_{\textsc{lang}}$ (aligning visual features with language) to $\mathbf{I}_{\textsc{aux}}$ (aligning visual features with auxiliary computer vision features). This expert shift is crucial in correcting the initial wrong answer.
  • Figure 3: Illustration of R2-T2' test-time re-routing mechanism with three strategies. (a) Neighborhood Gradient Descent: Optimizes $r$ using gradients derived from neighbors' loss functions ($\nabla_{r} l_1$, $\nabla_{r} l_2$, and $\nabla_{r} l_3$ for the 3 nearest neighbors), weighted by their similarity to the test sample. (b) Kernel Regression: Estimates $r$ as a weighted average of neighbors' routing weights ($\hat{r}$), and further optimizes it through binary search between $\hat{r}$ and initial weights $r$ to find the optimal coefficient $\alpha$. (c) Mode Finding: Iteratively updates $r$ through weighted interpolation between currecnt weights and the local average $\bar{r}$ in routing weight space, shifting towards the densest region.
  • Figure 4: Top-1 expert transitions to correct/incorrect preditions on CVBench2D/3D after re-routing. The primary transitions to correct predictions in (a) include $\mathbf{I}_{\textsc{LANG}}$ to $\mathbf{L}_{\textsc{IMG}}$, $\mathbf{L}_{\textsc{AUX}}$ and $\mathbf{L}_{\textsc{AUX}}$. The primary transitions to incorrect predictions in (b) include $\mathbf{I}_{\textsc{LANG}}$ to $\mathbf{I}_{\textsc{AUX}}$, $\mathbf{L}_{\textsc{IMG}}$ and $\mathbf{L}_{\textsc{AUX}}$. R2-T2 considerably mitigates the modality imbalance of the base model.
  • Figure 5: Transition between correct and incorrect predictions on CVBench2D/3D during NGD steps of R2-T2 from Step 0 to 10. NGD keeps turning more incorrect predictions to correct.
  • ...and 9 more figures