Table of Contents
Fetching ...

MIBench: Evaluating Multimodal Large Language Models over Multiple Images

Haowei Liu, Xi Zhang, Haiyang Xu, Yaya Shi, Chaoya Jiang, Ming Yan, Ji Zhang, Fei Huang, Chunfeng Yuan, Bing Li, Weiming Hu

TL;DR

MIBench addresses the gap in evaluating multimodal large language models on multi-image inputs by introducing a large-scale benchmark with three scenarios (Multi-Image Instruction, Multimodal Knowledge-Seeking, and Multimodal In-Context Learning) across 13 tasks and 13K annotated samples. It combines GPT-4-driven data generation, targeted distractor strategies, and rigorous quality control to produce high-quality test items and employs robust evaluation protocols, including circular MC scoring and exact-match for short answers. The results show that current MLLMs, particularly open-source ones, struggle with fine-grained perception, multi-image reasoning, and MIC capabilities, revealing substantial room for improvement and a need for better multi-image pretraining, alignment, and in-context learning strategies. The publicly released data aims to spur further research and development toward robust multi-image understanding in real-world multimodal contexts.

Abstract

Built on the power of LLMs, numerous multimodal large language models (MLLMs) have recently achieved remarkable performance on various vision-language tasks. However, most existing MLLMs and benchmarks primarily focus on single-image input scenarios, leaving the performance of MLLMs when handling realistic multiple images underexplored. Although a few benchmarks consider multiple images, their evaluation dimensions and samples are very limited. In this paper, we propose a new benchmark MIBench, to comprehensively evaluate fine-grained abilities of MLLMs in multi-image scenarios. Specifically, MIBench categorizes the multi-image abilities into three scenarios: multi-image instruction (MII), multimodal knowledge-seeking (MKS) and multimodal in-context learning (MIC), and constructs 13 tasks with a total of 13K annotated samples. During data construction, for MII and MKS, we extract correct options from manual annotations and create challenging distractors to obtain multiple-choice questions. For MIC, to enable an in-depth evaluation, we set four sub-tasks and transform the original datasets into in-context learning formats. We evaluate several open-source and closed-source MLLMs on the proposed MIBench. The results reveal that although current models excel in single-image tasks, they exhibit significant shortcomings when faced with multi-image inputs, such as limited fine-grained perception, multi-image reasoning and in-context learning abilities. The annotated data of MIBench is available at https://huggingface.co/datasets/StarBottle/MIBench.

MIBench: Evaluating Multimodal Large Language Models over Multiple Images

TL;DR

MIBench addresses the gap in evaluating multimodal large language models on multi-image inputs by introducing a large-scale benchmark with three scenarios (Multi-Image Instruction, Multimodal Knowledge-Seeking, and Multimodal In-Context Learning) across 13 tasks and 13K annotated samples. It combines GPT-4-driven data generation, targeted distractor strategies, and rigorous quality control to produce high-quality test items and employs robust evaluation protocols, including circular MC scoring and exact-match for short answers. The results show that current MLLMs, particularly open-source ones, struggle with fine-grained perception, multi-image reasoning, and MIC capabilities, revealing substantial room for improvement and a need for better multi-image pretraining, alignment, and in-context learning strategies. The publicly released data aims to spur further research and development toward robust multi-image understanding in real-world multimodal contexts.

Abstract

Built on the power of LLMs, numerous multimodal large language models (MLLMs) have recently achieved remarkable performance on various vision-language tasks. However, most existing MLLMs and benchmarks primarily focus on single-image input scenarios, leaving the performance of MLLMs when handling realistic multiple images underexplored. Although a few benchmarks consider multiple images, their evaluation dimensions and samples are very limited. In this paper, we propose a new benchmark MIBench, to comprehensively evaluate fine-grained abilities of MLLMs in multi-image scenarios. Specifically, MIBench categorizes the multi-image abilities into three scenarios: multi-image instruction (MII), multimodal knowledge-seeking (MKS) and multimodal in-context learning (MIC), and constructs 13 tasks with a total of 13K annotated samples. During data construction, for MII and MKS, we extract correct options from manual annotations and create challenging distractors to obtain multiple-choice questions. For MIC, to enable an in-depth evaluation, we set four sub-tasks and transform the original datasets into in-context learning formats. We evaluate several open-source and closed-source MLLMs on the proposed MIBench. The results reveal that although current models excel in single-image tasks, they exhibit significant shortcomings when faced with multi-image inputs, such as limited fine-grained perception, multi-image reasoning and in-context learning abilities. The annotated data of MIBench is available at https://huggingface.co/datasets/StarBottle/MIBench.
Paper Structure (23 sections, 4 figures, 6 tables)

This paper contains 23 sections, 4 figures, 6 tables.

Figures (4)

  • Figure 1: Overview of our MIBench, which covers three multi-image scenarios and a total of 13 tasks.
  • Figure 2: Examples of the multi-image scenarios with a total of 13 tasks. The correct answers are marked in blue.
  • Figure 3: A qualitative case of the Subtle Difference task, where open-source MLLMs show inferior performance due to limited fine-grained perception ability.
  • Figure 4: Evaluation results on the Multimodal In-Context Learning scenario.