CausalVLBench: Benchmarking Visual Causal Reasoning in Large Vision-Language Models
Aneesh Komanduri, Karuna Bhaila, Xintao Wu
TL;DR
CausalVLBench introduces a formal benchmark for visual causal reasoning in large vision-language models, encapsulating three tasks—causal structure inference, intervention target prediction, and counterfactual prediction—across three physically grounded datasets. The study systematically evaluates a broad set of open-source LVLMs under zero- and few-shot settings, revealing that current models struggle with visual causal reasoning, especially for multi-image interventions and downstream causal propagation. Key findings show that larger models like Qwen2.5-VL and Gemini-2.0-Flash often excel in structure inference, while few-shot prompts and chain-of-thought prompting yield inconsistent benefits. The work highlights the need for new training paradigms to enhance visual causal reasoning and provides a transparent, reproducible benchmark for future research.
Abstract
Large language models (LLMs) have shown remarkable ability in various language tasks, especially with their emergent in-context learning capability. Extending LLMs to incorporate visual inputs, large vision-language models (LVLMs) have shown impressive performance in tasks such as recognition and visual question answering (VQA). Despite increasing interest in the utility of LLMs in causal reasoning tasks such as causal discovery and counterfactual reasoning, there has been relatively little work showcasing the abilities of LVLMs on visual causal reasoning tasks. We take this opportunity to formally introduce a comprehensive causal reasoning benchmark for multi-modal in-context learning from LVLMs. Our CausalVLBench encompasses three representative tasks: causal structure inference, intervention target prediction, and counterfactual prediction. We evaluate the ability of state-of-the-art open-source LVLMs on our causal reasoning tasks across three causal representation learning datasets and demonstrate their fundamental strengths and weaknesses. We hope that our benchmark elucidates the drawbacks of existing vision-language models and motivates new directions and paradigms in improving the visual causal reasoning abilities of LVLMs.
