Table of Contents
Fetching ...

MIBench: Evaluating LMMs on Multimodal Interaction

Yu Miao, Zequn Yang, Yake Wei, Ziheng Chen, Haotian Ni, Haodong Duan, Kai Chen, Di Hu

Abstract

In different multimodal scenarios, it needs to integrate and utilize information across modalities in a specific way based on the demands of the task. Different integration ways between modalities are referred to as "multimodal interaction". How well a model handles various multimodal interactions largely characterizes its multimodal ability. In this paper, we introduce MIBench, a comprehensive benchmark designed to evaluate the multimodal interaction capabilities of Large Multimodal Models (LMMs), which formulates each instance as a (con_v , con_t, task) triplet with contexts from vision and text, necessitating that LMMs employ correct forms of multimodal interaction to effectively complete the task. MIBench assesses models from three key aspects: the ability to source information from vision-centric or text-centric cues, and the ability to generate new information from their joint synergy. Each interaction capability is evaluated hierarchically across three cognitive levels: Recognition, Understanding, and Reasoning. MIBench comprises over 10,000 vision-text context pairs spanning 32 distinct tasks. Evaluation of state-of-the-art LMMs show that: (1) LMMs' ability on multimodal interaction remains constrained, despite the scaling of model parameters and training data; (2) they are easily distracted by textual modalities when processing vision information; (3) they mostly possess a basic capacity for multimodal synergy; and (4) natively trained multimodal models show noticeable deficits in fundamental interaction ability. We expect that these observations can serve as a reference for developing LMMs with more enhanced multimodal ability in the future.

MIBench: Evaluating LMMs on Multimodal Interaction

Abstract

In different multimodal scenarios, it needs to integrate and utilize information across modalities in a specific way based on the demands of the task. Different integration ways between modalities are referred to as "multimodal interaction". How well a model handles various multimodal interactions largely characterizes its multimodal ability. In this paper, we introduce MIBench, a comprehensive benchmark designed to evaluate the multimodal interaction capabilities of Large Multimodal Models (LMMs), which formulates each instance as a (con_v , con_t, task) triplet with contexts from vision and text, necessitating that LMMs employ correct forms of multimodal interaction to effectively complete the task. MIBench assesses models from three key aspects: the ability to source information from vision-centric or text-centric cues, and the ability to generate new information from their joint synergy. Each interaction capability is evaluated hierarchically across three cognitive levels: Recognition, Understanding, and Reasoning. MIBench comprises over 10,000 vision-text context pairs spanning 32 distinct tasks. Evaluation of state-of-the-art LMMs show that: (1) LMMs' ability on multimodal interaction remains constrained, despite the scaling of model parameters and training data; (2) they are easily distracted by textual modalities when processing vision information; (3) they mostly possess a basic capacity for multimodal synergy; and (4) natively trained multimodal models show noticeable deficits in fundamental interaction ability. We expect that these observations can serve as a reference for developing LMMs with more enhanced multimodal ability in the future.
Paper Structure (27 sections, 2 equations, 25 figures, 2 tables)

This paper contains 27 sections, 2 equations, 25 figures, 2 tables.

Figures (25)

  • Figure 1: (Left) Different multimodal tasks require different types of modality interaction. Some tasks depend on information mostly from one modality (visual or textual context), while others require synergistic collaboration between them. (Middle) We introduce MIBench to systematically evaluate how well Large Multimodal Models (LMMs) handle multimodal interactions, which is structured around three fundamental interaction patterns (Vision-centric, Text-centric, and Synergy) across three cognitive levels (Recognition, Understanding, and Reasoning). (Right) Most current LMMs demonstrate regrettably limited capabilities in multimodal interaction (especially in synergy), struggling to selectively utilize cues from centric modality and achieve cross-modal collaboration.
  • Figure 2: Overview of the framework and samples. MIBench covers three interaction forms (Vision-centric, Text-centric, and Synergy) and three hierarchical levels of ability: Recognition, Understanding, and Reasoning. Each sample is composed of a textual context, a visual context, and a task (Q&A). For the vision-centric and text-centric samples, multiple types of contexts from another modality are prepared in our evaluation, though only one is shown here for illustrative purposes.
  • Figure 3: Overview of MIBench sample formats. Each instance consists of both visual and textual contexts and the corresponding task, formulated as ($con_v$, $con_t$, $task$). For vision- and text-centric tasks (Left and Middle), they can be resolved by leveraging cues from the centric modality. We introduce various contexts from another modality to evaluate the model's ability to selectively utilize cues from the target modality, which range from helpful contexts (e.g., correct guidance, concept visualization) to misleading guidance and unrelated contents. For the synergy part (Right), the model is presented with one coupled visual-textual pair with complementary cues, necessitating effective cross-modal collaboration.
  • Figure 4: Overview of the sample annotation pipeline.
  • Figure 5: Performance of open-source LMMs across three progressively cognitive levels: Recognition (Level 1), Understanding (Level 2), and Reasoning (Level 3).
  • ...and 20 more figures