Table of Contents
Fetching ...

CauSight: Learning to Supersense for Visual Causal Discovery

Yize Zhang, Meiqi Chen, Sirui Chen, Bo Peng, Yanxi Zhang, Tianyu Li, Chaochao Lu

TL;DR

The paper defines visual causal discovery and introduces CauSight, a vision-language model trained on the large-scale Visual Causal Graph dataset (VCG-32K) to infer entity-level causal graphs from images. It proposes Tree-of-Causal-Thought (ToCT) to synthesize reasoning trajectories via region, entity, and causality actions and uses Monte Carlo Tree Search to explore trajectories, followed by supervised fine-tuning and reinforcement learning with a graph-based causal reward (GRPO) to optimize causal discovery. CauSight achieves substantial improvements over strong baselines, including GPT-4.1, with strong cross-domain generalization to Objects365, demonstrating the value of causally grounded reasoning in visual understanding. The work also provides a detailed dataset, training recipe, and ablations, emphasizing the importance of structured reasoning and causal priors for scalable, interpretable visual reasoning in real-world scenarios.

Abstract

Causal thinking enables humans to understand not just what is seen, but why it happens. To replicate this capability in modern AI systems, we introduce the task of visual causal discovery. It requires models to infer cause-and-effect relations among visual entities across diverse scenarios instead of merely perceiving their presence. To this end, we first construct the Visual Causal Graph dataset (VCG-32K), a large-scale collection of over 32,000 images annotated with entity-level causal graphs, and further develop CauSight, a novel vision-language model to perform visual causal discovery through causally aware reasoning. Our training recipe integrates three components: (1) training data curation from VCG-32K, (2) Tree-of-Causal-Thought (ToCT) for synthesizing reasoning trajectories, and (3) reinforcement learning with a designed causal reward to refine the reasoning policy. Experiments show that CauSight outperforms GPT-4.1 on visual causal discovery, achieving over a threefold performance boost (21% absolute gain). Our code, model, and dataset are fully open-sourced at project page: https://github.com/OpenCausaLab/CauSight.

CauSight: Learning to Supersense for Visual Causal Discovery

TL;DR

The paper defines visual causal discovery and introduces CauSight, a vision-language model trained on the large-scale Visual Causal Graph dataset (VCG-32K) to infer entity-level causal graphs from images. It proposes Tree-of-Causal-Thought (ToCT) to synthesize reasoning trajectories via region, entity, and causality actions and uses Monte Carlo Tree Search to explore trajectories, followed by supervised fine-tuning and reinforcement learning with a graph-based causal reward (GRPO) to optimize causal discovery. CauSight achieves substantial improvements over strong baselines, including GPT-4.1, with strong cross-domain generalization to Objects365, demonstrating the value of causally grounded reasoning in visual understanding. The work also provides a detailed dataset, training recipe, and ablations, emphasizing the importance of structured reasoning and causal priors for scalable, interpretable visual reasoning in real-world scenarios.

Abstract

Causal thinking enables humans to understand not just what is seen, but why it happens. To replicate this capability in modern AI systems, we introduce the task of visual causal discovery. It requires models to infer cause-and-effect relations among visual entities across diverse scenarios instead of merely perceiving their presence. To this end, we first construct the Visual Causal Graph dataset (VCG-32K), a large-scale collection of over 32,000 images annotated with entity-level causal graphs, and further develop CauSight, a novel vision-language model to perform visual causal discovery through causally aware reasoning. Our training recipe integrates three components: (1) training data curation from VCG-32K, (2) Tree-of-Causal-Thought (ToCT) for synthesizing reasoning trajectories, and (3) reinforcement learning with a designed causal reward to refine the reasoning policy. Experiments show that CauSight outperforms GPT-4.1 on visual causal discovery, achieving over a threefold performance boost (21% absolute gain). Our code, model, and dataset are fully open-sourced at project page: https://github.com/OpenCausaLab/CauSight.

Paper Structure

This paper contains 32 sections, 10 equations, 14 figures, 4 tables.

Figures (14)

  • Figure 1: A comparison between VLMs that understand (a) scene graph, which only specifies spatial relations between entities; (b) causal graph, which captures causal mechanisms between entities. Genuine reasoning and safe deployment require VLM to discover causal relations between entities.
  • Figure 2: The two-stage annotation pipeline of VCG-32K: bounding box refinement and causal relationship labeling.
  • Figure 3: Illustration of a single synthesized reasoning trajectory. The teacher model can repeatedly execute three key actions to extend the reasoning trajectory.
  • Figure 5: Detection-dependent performance stability. Graph recall is evaluated under varying GIoU thresholds, summarized by the Recall Stability Index (RSI). CauSight achieves strong causal discovery performance while maintaining high RSI, indicating a balanced integration of detection and reasoning capabilities.
  • Figure 6: Model generalizability across three OOD benchmarks. Qwen refers to Qwen2.5-VL-7B, +SFT to the SFT variant in the baseline. Each cell corresponds to the model’s accuracy.
  • ...and 9 more figures