Table of Contents
Fetching ...

From Easy to Hard: The MIR Benchmark for Progressive Interleaved Multi-Image Reasoning

Hang Du, Jiayang Zhang, Guoshun Nan, Wendi Deng, Zhenyan Chen, Chenyang Zhang, Wang Xiao, Shan Huang, Yuqi Pan, Tao Qi, Sicong Leng

TL;DR

MIR tackles the challenge of joint reasoning over interleaved multi-image and text data to improve cross-modal understanding in MLLMs. It introduces a large-scale MIR dataset with 138,277 images and 22,257 QA pairs across 12 subtasks, accompanied by a five-step reasoning protocol (Summary, Caption, Text to region, Region to region, Conclusion) and a stage-wise curriculum that gradually increases task difficulty. The authors propose an adaptive difficulty filter and two-stage training to guide models from externally guided reasoning to autonomous reasoning, achieving consistent in-domain and out-of-domain gains across multiple open-source MLLMs. Experiments and case studies demonstrate that MIR enhances reasoning accuracy and promotes robust generalization, offering a practical benchmark for advancing multi-image interleaved reasoning in real-world cross-modal tasks.

Abstract

Multi-image Interleaved Reasoning aims to improve Multi-modal Large Language Models (MLLMs) ability to jointly comprehend and reason across multiple images and their associated textual contexts, introducing unique challenges beyond single-image or non-interleaved multi-image tasks. While current multi-image benchmarks overlook interleaved textual contexts and neglect distinct relationships between individual images and their associated texts, enabling models to reason over multi-image interleaved data may significantly enhance their comprehension of complex scenes and better capture cross-modal correlations. To bridge this gap, we introduce a novel benchmark MIR, requiring joint reasoning over multiple images accompanied by interleaved textual contexts to accurately associate image regions with corresponding texts and logically connect information across images. To enhance MLLMs ability to comprehend multi-image interleaved data, we introduce reasoning steps for each instance within the benchmark and propose a stage-wise curriculum learning strategy. This strategy follows an "easy to hard" approach, progressively guiding models from simple to complex scenarios, thereby enhancing their ability to handle challenging tasks. Extensive experiments benchmarking multiple MLLMs demonstrate that our method significantly enhances models reasoning performance on MIR and other established benchmarks. We believe that MIR will encourage further research into multi-image interleaved reasoning, facilitating advancements in MLLMs capability to handle complex inter-modal tasks.

From Easy to Hard: The MIR Benchmark for Progressive Interleaved Multi-Image Reasoning

TL;DR

MIR tackles the challenge of joint reasoning over interleaved multi-image and text data to improve cross-modal understanding in MLLMs. It introduces a large-scale MIR dataset with 138,277 images and 22,257 QA pairs across 12 subtasks, accompanied by a five-step reasoning protocol (Summary, Caption, Text to region, Region to region, Conclusion) and a stage-wise curriculum that gradually increases task difficulty. The authors propose an adaptive difficulty filter and two-stage training to guide models from externally guided reasoning to autonomous reasoning, achieving consistent in-domain and out-of-domain gains across multiple open-source MLLMs. Experiments and case studies demonstrate that MIR enhances reasoning accuracy and promotes robust generalization, offering a practical benchmark for advancing multi-image interleaved reasoning in real-world cross-modal tasks.

Abstract

Multi-image Interleaved Reasoning aims to improve Multi-modal Large Language Models (MLLMs) ability to jointly comprehend and reason across multiple images and their associated textual contexts, introducing unique challenges beyond single-image or non-interleaved multi-image tasks. While current multi-image benchmarks overlook interleaved textual contexts and neglect distinct relationships between individual images and their associated texts, enabling models to reason over multi-image interleaved data may significantly enhance their comprehension of complex scenes and better capture cross-modal correlations. To bridge this gap, we introduce a novel benchmark MIR, requiring joint reasoning over multiple images accompanied by interleaved textual contexts to accurately associate image regions with corresponding texts and logically connect information across images. To enhance MLLMs ability to comprehend multi-image interleaved data, we introduce reasoning steps for each instance within the benchmark and propose a stage-wise curriculum learning strategy. This strategy follows an "easy to hard" approach, progressively guiding models from simple to complex scenarios, thereby enhancing their ability to handle challenging tasks. Extensive experiments benchmarking multiple MLLMs demonstrate that our method significantly enhances models reasoning performance on MIR and other established benchmarks. We believe that MIR will encourage further research into multi-image interleaved reasoning, facilitating advancements in MLLMs capability to handle complex inter-modal tasks.

Paper Structure

This paper contains 17 sections, 3 equations, 6 figures, 3 tables.

Figures (6)

  • Figure 1: An example from MIR. This question aims to compare the sizes of the "truck" and "palm". Dashed box content is inferred by the model from context. Text2Region connects explicit textual elements ("truck" and "cola") and implicit objects ("human palm in image3", "truck in image1") to image regions. Region2Region establishes relationships between these regions, such as the cola and truck in image2, the palm and cola in image3. By reasoning over the relationship between "palm" "Coca-Cola", and "truck", the correct size relationship is derived.
  • Figure 2: Overview of our proposed MIR Benchmark. It comprises $22,257$ questions derived from $138,277$ images, organized into three distinct categories—sequential ($5,455$), spatial ($9,350$), and analytical ($7,821$)—which are further divided into $12$ fine-grained tasks to rigorously evaluate multi-image interleaved reasoning.
  • Figure 3: Illustration of data construction. We collect and filter data from multiple sources to obtain raw images or text. For spatial tasks, we generate/synthesize images by our platform; for sequential tasks, we extract frames/calculate offsets from videos; for analytical tasks, we leverage web scraping/model editing to process images. The final dataset is created by annotating images and text through multiple rounds using LLMs and human effort.
  • Figure 4: Statistics of our MIR dataset. Figure (a) shows the categories of MIR along with their respective proportions. Figure (b) illustrates the total number of images as well as the number of images per group for each task type. The average character count per annotation for each category is depicted in Figure (c)
  • Figure 5: Architecture of the proposed method. We use an expert model to split the dataset into easy and difficult samples. After fine-tuning on easy samples, we sample 40% of difficult ones for multi-stage curriculum learning. First, we combine the question with summary, caption, text2region, and region2region as input to generate the conclusion. Next, we move region2region to the target output, then add text-to-region from the question, followed by the caption, and finally the summary. The goal is for the model to autonomously generate a full reasoning process from the original question and arrive at the correct answer.
  • ...and 1 more figures