SMIR: Efficient Synthetic Data Pipeline To Improve Multi-Image Reasoning

Andrew Li; Rahul Thapa; Rahul Chalamala; Qingyang Wu; Kezhen Chen; James Zou

SMIR: Efficient Synthetic Data Pipeline To Improve Multi-Image Reasoning

Andrew Li, Rahul Thapa, Rahul Chalamala, Qingyang Wu, Kezhen Chen, James Zou

TL;DR

This work tackles the bottleneck of open-source multi-image reasoning by introducing SMiR, a synthetic data pipeline that creates highly correlated image-caption groups and generates 160K instruction-tuned samples using open-source LLMs. A key component is a multimodal embedding, defined as $E_{multimodal} = E_{image} + c \cdot E_{caption}$, combined with clustering and iterative sampling to select coherent but diverse image sets for prompting. The authors also introduce SMiR-Bench, a 200-example, multi-turn benchmark evaluated with a VLM judge across seven reasoning tasks, enabling robust multimodal evaluation beyond multiple-choice formats. Fine-tuning open-source VLMs on SMiR yields up to 8% gains on SMiR-Bench, indicating practical value for improving open-source multi-image reasoning, while illustrating the gap relative to closed-source models and pointing to future work on scalability and generalization.

Abstract

Vision-Language Models (VLMs) excel at understanding single images, aided by high-quality instruction datasets. However, multi-image reasoning remains underexplored in the open-source community due to two key challenges: (1) scaling datasets with correlated images and complex reasoning instructions is resource-intensive, and (2) robust evaluation benchmarks for multi-image tasks are lacking. To address this, we introduce SMiR, a synthetic data-generation pipeline for multi-image reasoning, along with a high-quality dataset generated using this pipeline. SMiR efficiently extracts correlated images via multimodal embeddings, integrates visual and descriptive information, and leverages open-source LLMs to generate quality instructions. Using this approach, we produce 160K synthetic training samples, offering a cost-effective alternative to closed-source solutions. Additionally, we present SMiR-Bench, a multi-image reasoning benchmark comprising 200 diverse examples across seven complex reasoning tasks. SMiR-Bench is multi-turn and employs a VLM judge to evaluate free-form responses, providing a comprehensive assessment of model expressiveness and reasoning capability across modalities. We demonstrate the effectiveness of SMiR by fine-tuning open-source VLMs and evaluating them on SMiR-Bench.

SMIR: Efficient Synthetic Data Pipeline To Improve Multi-Image Reasoning

TL;DR

, combined with clustering and iterative sampling to select coherent but diverse image sets for prompting. The authors also introduce SMiR-Bench, a 200-example, multi-turn benchmark evaluated with a VLM judge across seven reasoning tasks, enabling robust multimodal evaluation beyond multiple-choice formats. Fine-tuning open-source VLMs on SMiR yields up to 8% gains on SMiR-Bench, indicating practical value for improving open-source multi-image reasoning, while illustrating the gap relative to closed-source models and pointing to future work on scalability and generalization.

Abstract

Paper Structure (24 sections, 2 equations, 6 figures, 5 tables, 2 algorithms)

This paper contains 24 sections, 2 equations, 6 figures, 5 tables, 2 algorithms.

Introduction
SMiR: Synthetic Multi-Image Reasoning Data Pipeline
Multimodal Embedding Construction
Random Sampling with Iteration
Implementation
Multi-Image Benchmark
Benchmark Overview
Evaluation Methodology
Results
Conclusion
Related Works
Vision Language Models
Multi-Image Reasoning Data
Multi-Image Reasoning Benchmarks
Algorithm Details
...and 9 more sections

Figures (6)

Figure 1: Our end-to-end pipeline converts image-caption pairs into synthetic multi-turn conversations using multimodal embeddings, strategic sampling, and LLM prompting. The example, based on a sports scenario, illustrates how the pipeline generates contextually rich dialogues by leveraging visual relationships.
Figure 2: Comparison of an example from the MANTIS dataset, which includes samples from ShareGPT4V-PT where unrelated images such as animals and other objects are concatenated (top), vs. an example from the SMiR dataset, where only related images are grouped together (bottom), for multi-image reasoning.
Figure 3: Evaluation Benchmark (Storytelling) Using GPT-4o as Judge
Figure 4: Sampling based on distances between multimodal embeddings of image-caption pairs.
Figure 5: Images sampled from the same matched cluster often feature similar subjects or scenes.
...and 1 more figures

SMIR: Efficient Synthetic Data Pipeline To Improve Multi-Image Reasoning

TL;DR

Abstract

SMIR: Efficient Synthetic Data Pipeline To Improve Multi-Image Reasoning

Authors

TL;DR

Abstract

Table of Contents

Figures (6)