Table of Contents
Fetching ...

MM-InstructEval: Zero-Shot Evaluation of (Multimodal) Large Language Models on Multimodal Reasoning Tasks

Xiaocui Yang, Wenfang Wu, Shi Feng, Ming Wang, Daling Wang, Yang Li, Qi Sun, Yifei Zhang, Xiaoming Fu, Soujanya Poria

TL;DR

MM-InstructEval tackles the gap in evaluating multimodal reasoning that combines vision and text by introducing a zero-shot framework, 45 models, 16 datasets, 6 tasks, and 10 instructions. It proposes four metrics—Best Performance, Mean Relative Gain, Stability, and Adaptability—to comprehensively assess model and instruction efficacy, robustness, and compatibility. The study finds that closed-source models often outperform open ones on challenging tasks, yet newer open architectures (e.g., Flan-T5-based, Qwen-VL variants) are closing the gap, and instruction design (notably QA formats and option-rich prompts) significantly impacts results. These findings yield practical guidance for model selection and instruction engineering and establish benchmarks to drive future development in multimodal reasoning for MLLMs.

Abstract

The emergence of multimodal large language models (MLLMs) has triggered extensive research in model evaluation. While existing evaluation studies primarily focus on unimodal (vision-only) comprehension and reasoning capabilities, they overlook critical assessments of complex multimodal reasoning tasks that require integrated understanding of both visual and textual contexts. Such multimodal tasks present unique challenges, demanding sophisticated reasoning across multiple modalities and deep comprehension of multimodal contexts. In this paper, we present MM-InstructEval, a comprehensive evaluation framework that incorporates diverse metrics to assess model performance across various multimodal reasoning tasks with vision-text contexts. We conduct extensive zero-shot evaluations on 45 models (including 36 MLLMs) across 16 multimodal datasets, encompassing 6 distinct tasks using 10 different instructions. Our framework introduces multiple innovative metrics, including the 'Best Performance' metric to benchmark peak model capabilities, the 'Mean Relative Gain' metric to assess overall efficacy across models and instructions, the 'Stability' metric to measure robustness, and the 'Adaptability' metric to quantify the compatibility between models and instructions. Through comprehensive evaluation and analysis, we uncover several significant insights about model architectures, instruction formats, and their interactions in multimodal reasoning tasks. Our findings establish new benchmarks for assessing the reasoning capabilities of MLLMs and provide strategic guidance for future developments. To facilitate continued research and evaluation in this field, we release our framework and resources at https://github.com/declare-lab/MM-InstructEval, with an interactive leaderboard available at MM-InstructEval Leaderboard (https://declare-lab.github.io/MM-InstructEval/).

MM-InstructEval: Zero-Shot Evaluation of (Multimodal) Large Language Models on Multimodal Reasoning Tasks

TL;DR

MM-InstructEval tackles the gap in evaluating multimodal reasoning that combines vision and text by introducing a zero-shot framework, 45 models, 16 datasets, 6 tasks, and 10 instructions. It proposes four metrics—Best Performance, Mean Relative Gain, Stability, and Adaptability—to comprehensively assess model and instruction efficacy, robustness, and compatibility. The study finds that closed-source models often outperform open ones on challenging tasks, yet newer open architectures (e.g., Flan-T5-based, Qwen-VL variants) are closing the gap, and instruction design (notably QA formats and option-rich prompts) significantly impacts results. These findings yield practical guidance for model selection and instruction engineering and establish benchmarks to drive future development in multimodal reasoning for MLLMs.

Abstract

The emergence of multimodal large language models (MLLMs) has triggered extensive research in model evaluation. While existing evaluation studies primarily focus on unimodal (vision-only) comprehension and reasoning capabilities, they overlook critical assessments of complex multimodal reasoning tasks that require integrated understanding of both visual and textual contexts. Such multimodal tasks present unique challenges, demanding sophisticated reasoning across multiple modalities and deep comprehension of multimodal contexts. In this paper, we present MM-InstructEval, a comprehensive evaluation framework that incorporates diverse metrics to assess model performance across various multimodal reasoning tasks with vision-text contexts. We conduct extensive zero-shot evaluations on 45 models (including 36 MLLMs) across 16 multimodal datasets, encompassing 6 distinct tasks using 10 different instructions. Our framework introduces multiple innovative metrics, including the 'Best Performance' metric to benchmark peak model capabilities, the 'Mean Relative Gain' metric to assess overall efficacy across models and instructions, the 'Stability' metric to measure robustness, and the 'Adaptability' metric to quantify the compatibility between models and instructions. Through comprehensive evaluation and analysis, we uncover several significant insights about model architectures, instruction formats, and their interactions in multimodal reasoning tasks. Our findings establish new benchmarks for assessing the reasoning capabilities of MLLMs and provide strategic guidance for future developments. To facilitate continued research and evaluation in this field, we release our framework and resources at https://github.com/declare-lab/MM-InstructEval, with an interactive leaderboard available at MM-InstructEval Leaderboard (https://declare-lab.github.io/MM-InstructEval/).
Paper Structure (32 sections, 7 equations, 8 figures, 8 tables)

This paper contains 32 sections, 7 equations, 8 figures, 8 tables.

Figures (8)

  • Figure 1: Required capabilities for diverse datasets. Different from the traditional capabilities indicated above the dotted line for multimodal reasoning tasks with vision-only contexts—including COCO DBLP:conf/eccv/LinMBHPRDZ14, VQA v2 DBLP:conf/eccv/LinMBHPRDZ14, Text VQA DBLP:conf/cvpr/SinghNSJCBPR19, OK-VQA DBLP:conf/cvpr/MarinoRFM19, and MM-VET DBLP:journals/corr/abs-2308-02490—our own multimodal reasoning tasks that incorporate vision-text contexts, particularly those below the dotted line, require not only capabilities relevant to traditional tasks but also a profound interaction and understanding of complex vision-text contexts. In these tasks, 'T' represents the text context, 'Q' denotes the question prompting the models for answers, and 'GT' stands for the ground truth label. 'HE' and 'TE' correspond to the head entity and tail entity, respectively.
  • Figure 2: Overview of our MM-InstructEval framework, which conducts evaluations of popular MLLMs across various multimodal reasoning tasks with multimodal contexts utilizing comprehensive metrics. As illustrated in the dashed box, we select the 'MVSA-Single' dataset from the 'MSC' task to utilize 'Instruction # 2' for evaluating a specific 'MLLM'. We then aggregate results to thoroughly assess the performance of models and instructions using a variety of metrics. A colorful 'Q' symbolizes 'Question, ' and its design varies according to the specific task. For more detailed visual representations and explanations, please refer to Figure \ref{['Figure_2_text_instruction_for_different_tasks']}.
  • Figure 3: Inference process of Multimodal Language Models (MLLMs) for AlgoPuzzleVQA employing varied multimodal instructions. We construct instructions based on these formats, encompassing mandatory components, such as Task name, Task definition, and Output format, Question, as well as optional components, for instance, Context and Options. Furthermore, each format incorporates Specific instruction trigger words customized for the respective instruction. Note that only text context are provided for inputting the Large Language Models (LLMs).
  • Figure 4: Inference process of Multimodal Language Models (MLLMs) for MSD employing varied multimodal instructions.
  • Figure 5: Details of the various components of multimodal instructions for different tasks, such as Multimodal Sentiment Analysis (MSA), Multimodal Aspect-Based Sentiment Analysis (MABSA), Multimodal Hateful Memes Detection (MHMD), Multimodal Sarcasm Detection (MSD), and Multimodal Relation Extraction (MRE).
  • ...and 3 more figures