R2-T2: Re-Routing in Test-Time for Multimodal Mixture-of-Experts

Zhongyang Li; Ziyue Li; Tianyi Zhou

R2-T2: Re-Routing in Test-Time for Multimodal Mixture-of-Experts

Zhongyang Li, Ziyue Li, Tianyi Zhou

TL;DR

R2-T2 introduces a training-free test-time re-routing mechanism for multimodal MoE models, aiming to correct suboptimal routing weights without retraining. It formulates three strategies—Neighborhood Gradient Descent, Kernel Regression, and Mode Finding—that adapt routing using a reference set of correctly predicted samples in a task-embedding space. Across MoAI-7B and MoVA-7B, R2-T2 achieves substantial gains on eight benchmarks, approaching oracle upper bounds while incurring moderate computation. This approach demonstrates that dynamic, test-time routing refinement can significantly improve cross-modal reasoning without expensive fine-tuning, enhancing robustness and generalization for smaller, efficient MoE models.

Abstract

In large multimodal models (LMMs), the perception of non-language modalities (e.g., visual representations) is usually not on par with the large language models (LLMs)' powerful reasoning capabilities, deterring LMMs' performance on challenging downstream tasks. This weakness has been recently mitigated by replacing the vision encoder with a mixture-of-experts (MoE), which provides rich, multi-granularity, and diverse representations required by diverse downstream tasks. The performance of multimodal MoE largely depends on its router, which reweights and mixes the representations of different experts for each input. However, we find that the end-to-end trained router does not always produce the optimal routing weights for every test sample. To bridge the gap, we propose a novel and efficient method "Re-Routing in Test-Time (R2-T2)" that locally optimizes the vector of routing weights in test-time by moving it toward those vectors of the correctly predicted samples in a neighborhood of the test sample. We propose three R2-T2 strategies with different optimization objectives and neighbor-search spaces. R2-T2 consistently and greatly improves state-of-the-art LMMs' performance on challenging benchmarks of diverse tasks, without training any base-model parameters.

R2-T2: Re-Routing in Test-Time for Multimodal Mixture-of-Experts

TL;DR

Abstract

R2-T2: Re-Routing in Test-Time for Multimodal Mixture-of-Experts

TL;DR

Abstract

Paper Structure

Table of Contents

Figures (14)