Table of Contents
Fetching ...

MEXA: Towards General Multimodal Reasoning with Dynamic Multi-Expert Aggregation

Shoubin Yu, Yue Zhang, Ziyang Wang, Jaehong Yoon, Mohit Bansal

TL;DR

MEXA presents a training-free framework that dynamically coordinates a pool of modality- and skill-specific experts through an MLLM router and an LRM-based aggregator to enable general multimodal reasoning. Each expert outputs a textual representation of its modality-specific reasoning, which the aggregator reason over to produce final answers, enabling transparent, scalable reasoning across diverse domains. Empirically, MEXA achieves consistent gains over strong baselines on Video-MMMU, MMAU, SQA3D, and M3D, underscoring the value of explicit expert selection and long-context reasoning in multimodal tasks. The approach reduces training overhead while maintaining broad applicability, with ablations confirming the importance of the router and aggregator components and illustrating adaptive expert distributions across tasks.

Abstract

Combining pre-trained expert models offers substantial potential for scalable multimodal reasoning, but building a unified framework remains challenging due to the increasing diversity of input modalities and task complexity. For instance, medical diagnosis requires precise reasoning over structured clinical tables, while financial forecasting depends on interpreting plot-based data to make informed predictions. To tackle this challenge, we introduce MEXA, a training-free framework that performs modality- and task-aware aggregation of multiple expert models to enable effective multimodal reasoning across diverse and distinct domains. MEXA dynamically selects expert models based on the input modality and the task-specific reasoning demands (i.e., skills). Each expert model, specialized in a modality task pair, generates interpretable textual reasoning outputs. MEXA then aggregates and reasons over these outputs using a Large Reasoning Model (LRM) to produce the final answer. This modular design allows flexible and transparent multimodal reasoning across diverse domains without additional training overhead. We extensively evaluate our approach on diverse multimodal benchmarks, including Video Reasoning, Audio Reasoning, 3D Understanding, and Medical QA. MEXA consistently delivers performance improvements over strong multimodal baselines, highlighting the effectiveness and broad applicability of our expert-driven selection and aggregation in diverse multimodal reasoning tasks.

MEXA: Towards General Multimodal Reasoning with Dynamic Multi-Expert Aggregation

TL;DR

MEXA presents a training-free framework that dynamically coordinates a pool of modality- and skill-specific experts through an MLLM router and an LRM-based aggregator to enable general multimodal reasoning. Each expert outputs a textual representation of its modality-specific reasoning, which the aggregator reason over to produce final answers, enabling transparent, scalable reasoning across diverse domains. Empirically, MEXA achieves consistent gains over strong baselines on Video-MMMU, MMAU, SQA3D, and M3D, underscoring the value of explicit expert selection and long-context reasoning in multimodal tasks. The approach reduces training overhead while maintaining broad applicability, with ablations confirming the importance of the router and aggregator components and illustrating adaptive expert distributions across tasks.

Abstract

Combining pre-trained expert models offers substantial potential for scalable multimodal reasoning, but building a unified framework remains challenging due to the increasing diversity of input modalities and task complexity. For instance, medical diagnosis requires precise reasoning over structured clinical tables, while financial forecasting depends on interpreting plot-based data to make informed predictions. To tackle this challenge, we introduce MEXA, a training-free framework that performs modality- and task-aware aggregation of multiple expert models to enable effective multimodal reasoning across diverse and distinct domains. MEXA dynamically selects expert models based on the input modality and the task-specific reasoning demands (i.e., skills). Each expert model, specialized in a modality task pair, generates interpretable textual reasoning outputs. MEXA then aggregates and reasons over these outputs using a Large Reasoning Model (LRM) to produce the final answer. This modular design allows flexible and transparent multimodal reasoning across diverse domains without additional training overhead. We extensively evaluate our approach on diverse multimodal benchmarks, including Video Reasoning, Audio Reasoning, 3D Understanding, and Medical QA. MEXA consistently delivers performance improvements over strong multimodal baselines, highlighting the effectiveness and broad applicability of our expert-driven selection and aggregation in diverse multimodal reasoning tasks.

Paper Structure

This paper contains 28 sections, 5 equations, 4 figures, 7 tables.

Figures (4)

  • Figure 1: Overview of the MEXA Architecture. Given the input task context and question, MEXA first employs an MLLM router (\ref{['MLLM Router']}) to select the appropriate experts based on input modality and required reasoning skills. The aggregator (\ref{['aggregator']}) then reasons over the outputs from the selected experts to generate the final answer.
  • Figure 2: Expert distributions selected by MEXA across different benchmarks, covering video (Video-MMMU), audio (MMAU), 3D (SQA3D), and medical imaging (M3D).
  • Figure 3: A qualitative example of Video-MMMU.
  • Figure 6: Prompts for expert selection.