Table of Contents
Fetching ...

CRIT: Graph-Based Automatic Data Synthesis to Enhance Cross-Modal Multi-Hop Reasoning

Junyoung Sung, Seungwoo Lyu, Minjun Kim, Sumin An, Arsha Nagrani, Paul Hongsuck Seo

Abstract

Real-world reasoning often requires combining information across modalities, connecting textual context with visual cues in a multi-hop process. Yet, most multimodal benchmarks fail to capture this ability: they typically rely on single images or set of images, where answers can be inferred from a single modality alone. This limitation is mirrored in the training data, where interleaved image-text content rarely enforces complementary, multi-hop reasoning. As a result, Vision-Language Models (VLMs) frequently hallucinate and produce reasoning traces poorly grounded in visual evidence. To address this gap, we introduce CRIT, a new dataset and benchmark built with a graph-based automatic pipeline for generating complex cross-modal reasoning tasks. CRIT consists of diverse domains ranging from natural images, videos, and text-rich sources, and includes a manually verified test set for reliable evaluation. Experiments on this benchmark reveal that even state-of-the-art models struggle on such reasoning tasks. Models trained on CRIT show significant gains in cross-modal multi-hop reasoning, including strong improvements on SPIQA and other standard multimodal benchmarks.

CRIT: Graph-Based Automatic Data Synthesis to Enhance Cross-Modal Multi-Hop Reasoning

Abstract

Real-world reasoning often requires combining information across modalities, connecting textual context with visual cues in a multi-hop process. Yet, most multimodal benchmarks fail to capture this ability: they typically rely on single images or set of images, where answers can be inferred from a single modality alone. This limitation is mirrored in the training data, where interleaved image-text content rarely enforces complementary, multi-hop reasoning. As a result, Vision-Language Models (VLMs) frequently hallucinate and produce reasoning traces poorly grounded in visual evidence. To address this gap, we introduce CRIT, a new dataset and benchmark built with a graph-based automatic pipeline for generating complex cross-modal reasoning tasks. CRIT consists of diverse domains ranging from natural images, videos, and text-rich sources, and includes a manually verified test set for reliable evaluation. Experiments on this benchmark reveal that even state-of-the-art models struggle on such reasoning tasks. Models trained on CRIT show significant gains in cross-modal multi-hop reasoning, including strong improvements on SPIQA and other standard multimodal benchmarks.

Paper Structure

This paper contains 45 sections, 24 figures, 12 tables.

Figures (24)

  • Figure 1: Example from CRIT. The CRIT dataset is designed to train and evaluate cross-modal multi-hop reasoning over interleaved image-text contexts. Each example combines multiple textual and visual sources that must be jointly interpreted to answer a compositional question. The colored highlights in the text indicate distinct evidence segments contributing to the reasoning chain, while the numbered boxes in the images and text correspond to the multi-step inference process outlined in the Chain of Thoughts panel. Together, they illustrate how CRIT requires connecting dispersed clues across modalities to reach a grounded answer.
  • Figure 2: Overall Process of Cross-Modal Multi-Hop QA Generation. The procedure consists of three main stages. (1) Multimodal Content Graph Construction: images annotated with scene graphs are sampled (a), unique entities are filtered and merged (b), and text nodes connected with edges are generated via LLM prompting (c, d). (2) Textual Context Generation: subgraphs are extracted from the multimodal content graph (e) to produce complementary textual descriptions (f). (3) Question–Answer Generation: subgraphs are further sampled (g) to generate QA pairs requiring cross-modal multi-hop reasoning (h). Orange and yellow circular nodes represent visual entities originating from different images, while blue square nodes denote text nodes. Entities with identical names are distinguished by numerical subscripts (e.g., apple_1, apple_2).
  • Figure 3: Error categories and their distribution across 75 randomly sampled GPT-4o responses on CRIT.
  • Figure 4: Image Distribution
  • Figure 5: Text Distribution
  • ...and 19 more figures