SMIR: Efficient Synthetic Data Pipeline To Improve Multi-Image Reasoning
Andrew Li, Rahul Thapa, Rahul Chalamala, Qingyang Wu, Kezhen Chen, James Zou
TL;DR
This work tackles the bottleneck of open-source multi-image reasoning by introducing SMiR, a synthetic data pipeline that creates highly correlated image-caption groups and generates 160K instruction-tuned samples using open-source LLMs. A key component is a multimodal embedding, defined as $E_{multimodal} = E_{image} + c \cdot E_{caption}$, combined with clustering and iterative sampling to select coherent but diverse image sets for prompting. The authors also introduce SMiR-Bench, a 200-example, multi-turn benchmark evaluated with a VLM judge across seven reasoning tasks, enabling robust multimodal evaluation beyond multiple-choice formats. Fine-tuning open-source VLMs on SMiR yields up to 8% gains on SMiR-Bench, indicating practical value for improving open-source multi-image reasoning, while illustrating the gap relative to closed-source models and pointing to future work on scalability and generalization.
Abstract
Vision-Language Models (VLMs) excel at understanding single images, aided by high-quality instruction datasets. However, multi-image reasoning remains underexplored in the open-source community due to two key challenges: (1) scaling datasets with correlated images and complex reasoning instructions is resource-intensive, and (2) robust evaluation benchmarks for multi-image tasks are lacking. To address this, we introduce SMiR, a synthetic data-generation pipeline for multi-image reasoning, along with a high-quality dataset generated using this pipeline. SMiR efficiently extracts correlated images via multimodal embeddings, integrates visual and descriptive information, and leverages open-source LLMs to generate quality instructions. Using this approach, we produce 160K synthetic training samples, offering a cost-effective alternative to closed-source solutions. Additionally, we present SMiR-Bench, a multi-image reasoning benchmark comprising 200 diverse examples across seven complex reasoning tasks. SMiR-Bench is multi-turn and employs a VLM judge to evaluate free-form responses, providing a comprehensive assessment of model expressiveness and reasoning capability across modalities. We demonstrate the effectiveness of SMiR by fine-tuning open-source VLMs and evaluating them on SMiR-Bench.
