Diagnosing Causal Reasoning in Vision-Language Models via Structured Relevance Graphs

Dhita Putri Pratama; Soyeon Caren Han; Yihao Ding

Diagnosing Causal Reasoning in Vision-Language Models via Structured Relevance Graphs

Dhita Putri Pratama, Soyeon Caren Han, Yihao Ding

TL;DR

Experiments in state-of-the-art LVLMs show that injecting structured relevance information significantly improves attribution and inference consistency compared to zero-shot and standard in-context learning, suggesting that current limitations in LVLM causal reasoning stem primarily from insufficient structural guidance rather than a lack of reasoning capacity.

Abstract

Large Vision-Language Models (LVLMs) achieve strong performance on visual question answering benchmarks, yet often rely on spurious correlations rather than genuine causal reasoning. Existing evaluations primarily assess the correctness of the answers, making it unclear whether failures arise from limited reasoning capability or from misidentifying causally relevant information. We introduce Vision-Language Causal Graphs (VLCGs), a structured, query-conditioned representation that explicitly encodes causally relevant objects, attributes, relations, and scene-grounded assumptions. Building on this representation, we present ViLCaR, a diagnostic benchmark comprising tasks for Causal Attribution, Causal Inference, and Question Answering, along with graph-aligned evaluation metrics that assess relevance identification beyond final answer accuracy. Experiments in state-of-the-art LVLMs show that injecting structured relevance information significantly improves attribution and inference consistency compared to zero-shot and standard in-context learning. These findings suggest that current limitations in LVLM causal reasoning stem primarily from insufficient structural guidance rather than a lack of reasoning capacity.

Diagnosing Causal Reasoning in Vision-Language Models via Structured Relevance Graphs

TL;DR

Abstract

Paper Structure (11 sections, 1 equation, 4 figures, 2 tables)

This paper contains 11 sections, 1 equation, 4 figures, 2 tables.

Introduction
VilCaR
Vision-Language Causal Graphs (VLCGs)
ViLCaR Tasks
Dataset Construction
Data Statistics
Experiment Setup
Results
Overall Performance
Qualitative Analysis
Conclusion

Figures (4)

Figure 1: Example of a VLCG. Given an image-question pair (“Have these people just married?”), the graph encodes causally relevant objects (e.g., persons, cake), attributes (wedding dress, suit), relations (wear), and scene-grounded assumptions linking visual evidence to the conclusion. Unlike scene graphs, VLCGs capture question-conditioned causal relevance rather than complete perceptual structure.
Figure 2: Three diagnostic tasks in ViLCaR derived from the verified and pruned VLCGs: CA, CI, and QA.
Figure 3: A brief data statistics of VLCGs, with 'person', mental states (e.g., facial expression), and physical relationships of the object (e.g., hold) being the most frequent objects, object characteristics, object relations, respectively.
Figure 4: Reasonings from Qwen2.5-VL 7B model with: (1) Zero-shot, (2) Standard ICL, and (3) VLCG-augmented. Compared to the baselines, the model with ViLCaR injection is able to identify information about the role of the people relevant to the question.

Diagnosing Causal Reasoning in Vision-Language Models via Structured Relevance Graphs

TL;DR

Abstract

Diagnosing Causal Reasoning in Vision-Language Models via Structured Relevance Graphs

Authors

TL;DR

Abstract

Table of Contents

Figures (4)