Table of Contents
Fetching ...

Towards Robust Multi-Modal Reasoning via Model Selection

Xiangyan Liu, Rongxue Li, Wei Ji, Tao Lin

TL;DR

This framework improves model selection and bolsters the robustness of multi-modal agents in multi-step reasoning, and enables dynamic model selection, considering both user inputs and subtask dependencies, thereby robustifying the overall reasoning process.

Abstract

The reasoning capabilities of LLM (Large Language Model) are widely acknowledged in recent research, inspiring studies on tool learning and autonomous agents. LLM serves as the "brain" of the agent, orchestrating multiple tools for collaborative multi-step task solving. Unlike methods invoking tools like calculators or weather APIs for straightforward tasks, multi-modal agents excel by integrating diverse AI models for complex challenges. However, current multi-modal agents neglect the significance of model selection: they primarily focus on the planning and execution phases, and will only invoke predefined task-specific models for each subtask, making the execution fragile. Meanwhile, other traditional model selection methods are either incompatible with or suboptimal for the multi-modal agent scenarios, due to ignorance of dependencies among subtasks arising by multi-step reasoning. To this end, we identify the key challenges therein and propose the $\textit{M}^3$ framework as a plug-in with negligible runtime overhead at test-time. This framework improves model selection and bolsters the robustness of multi-modal agents in multi-step reasoning. In the absence of suitable benchmarks, we create MS-GQA, a new dataset specifically designed to investigate the model selection challenge in multi-modal agents. Our experiments reveal that our framework enables dynamic model selection, considering both user inputs and subtask dependencies, thereby robustifying the overall reasoning process. Our code and benchmark: https://github.com/LINs-lab/M3.

Towards Robust Multi-Modal Reasoning via Model Selection

TL;DR

This framework improves model selection and bolsters the robustness of multi-modal agents in multi-step reasoning, and enables dynamic model selection, considering both user inputs and subtask dependencies, thereby robustifying the overall reasoning process.

Abstract

The reasoning capabilities of LLM (Large Language Model) are widely acknowledged in recent research, inspiring studies on tool learning and autonomous agents. LLM serves as the "brain" of the agent, orchestrating multiple tools for collaborative multi-step task solving. Unlike methods invoking tools like calculators or weather APIs for straightforward tasks, multi-modal agents excel by integrating diverse AI models for complex challenges. However, current multi-modal agents neglect the significance of model selection: they primarily focus on the planning and execution phases, and will only invoke predefined task-specific models for each subtask, making the execution fragile. Meanwhile, other traditional model selection methods are either incompatible with or suboptimal for the multi-modal agent scenarios, due to ignorance of dependencies among subtasks arising by multi-step reasoning. To this end, we identify the key challenges therein and propose the framework as a plug-in with negligible runtime overhead at test-time. This framework improves model selection and bolsters the robustness of multi-modal agents in multi-step reasoning. In the absence of suitable benchmarks, we create MS-GQA, a new dataset specifically designed to investigate the model selection challenge in multi-modal agents. Our experiments reveal that our framework enables dynamic model selection, considering both user inputs and subtask dependencies, thereby robustifying the overall reasoning process. Our code and benchmark: https://github.com/LINs-lab/M3.
Paper Structure (53 sections, 2 equations, 9 figures, 9 tables)

This paper contains 53 sections, 2 equations, 9 figures, 9 tables.

Figures (9)

  • Figure 1: Illustration of the multi-modal multi-step reasoning process and three model selection paradigms within. (a) shows how multi-modal agents utilize LLMs to decompose complex multi-modal tasks, resulting in a multi-step reasoning process where each node corresponds to a simpler yet more specific subtask. (b) highlights that compared to a robust model selector, simplistic model selection methods are more prone to generating wrong outcomes at intermediate subtask stages, thereby impacting the ultimate reasoning result. Here, $\mathbf{\text{m}_{\text{(i)}}}$ indicates the $i$-th model is selected and the color means the corresponding subtask type. (c) numerically illustrates the comparative outcomes of model selection methods from different paradigms in multi-modal reasoning.
  • Figure 2: Comparison of three model selection paradigms under various inputs. The model selection processes of the three paradigms, from left (simplistic) to right (subtask dependency-aware), become progressively more fine-grained. "Simplistic" is inflexible and can be considered input-agnostic. "Traditional" can solely depend on subtask type and the corresponding original input information for model selection. When inputs are similar, "Traditional" cannot provide as diverse model selections as "Subtask Dependency-Aware", which leverages differences in reasoning logic to offer more varied and suitable model choices. Note that node P (green circle) in the figure denotes Python module invocation, which does not entail model selection.
  • Figure 3: Illustration of $\textbf{M}^\textbf{3}$: (a) depicts the forward computation process: 1) Task Graph: An initial virtual node represents the multi-modal input. Specific models are assigned to each subtask node based on the respective subtask type. 2) Node Embedding: Features are extracted using the multi-modal encoder $\psi_1(\cdot)$ and embedding table $\psi_2(\cdot)$ for the initial virtual node and subtask nodes. 3) Computation Graph Learner: The computation graph, including node features and subtask dependencies (edges $\mathcal{E}_i$), serves as input to learner $\phi(\cdot)$, contributing to the predicted execution status $s_i^j$. (b) illustrates the process of ranking and selecting the model selection choice with a greater likelihood of success.
  • Figure 4: The proportions of various structural categories in GQA.
  • Figure 5: Performance comparison in scenarios with missing training data and varying time constraints at test-time. (a) and (b) depict two data-missing scenarios with progressively increasing proportions on the x-axis. (c) illustrates method performance across different time constraints.
  • ...and 4 more figures

Theorems & Definitions (3)

  • Definition 3.1: Model selection for each subtask type on a multi-modal task graph
  • Definition 3.2: Subtask dependency on a multi-modal task graph
  • Remark 3.3: Generalizing model selection beyond the subtask type