Table of Contents
Fetching ...

CounterVQA: Evaluating and Improving Counterfactual Reasoning in Vision-Language Models for Video Understanding

Yuefei Chen, Jiang Liu, Xiaodong Lin, Ruixiang Tang

TL;DR

This work targets robust counterfactual reasoning in video-based vision-language models. It introduces CounterVQA, a large benchmark with three progressive levels of counterfactual complexity and two interaction types, built on explicit causal graphs via a multi-agent graph-generation pipeline. To address observed gaps, the authors propose CFGPT, a two-stage post-training framework that transfers textual causal reasoning to video grounding and reinforces it with visual-causal alignment rewards, achieving substantial gains over baselines across all levels. The results highlight that specialized, causally informed training beats mere model scaling and pave a path toward more reliable, causally aware AI systems for dynamic real-world video understanding.

Abstract

Vision Language Models (VLMs) have recently shown significant advancements in video understanding, especially in feature alignment, event reasoning, and instruction-following tasks. However, their capability for counterfactual reasoning, inferring alternative outcomes under hypothetical conditions, remains underexplored. This capability is essential for robust video understanding, as it requires identifying underlying causal structures and reasoning about unobserved possibilities, rather than merely recognizing observed patterns. To systematically evaluate this capability, we introduce CounterVQA, a video-based benchmark featuring three progressive difficulty levels that assess different aspects of counterfactual reasoning. Through comprehensive evaluation of both state-of-the-art open-source and closed-source models, we uncover a substantial performance gap: while these models achieve reasonable accuracy on simple counterfactual questions, performance degrades significantly on complex multi-hop causal chains. To address these limitations, we develop a post-training method, CFGPT, that enhances a model's visual counterfactual reasoning ability by distilling its counterfactual reasoning capability from the language modality, yielding consistent improvements across all CounterVQA difficulty levels. Dataset and code will be further released.

CounterVQA: Evaluating and Improving Counterfactual Reasoning in Vision-Language Models for Video Understanding

TL;DR

This work targets robust counterfactual reasoning in video-based vision-language models. It introduces CounterVQA, a large benchmark with three progressive levels of counterfactual complexity and two interaction types, built on explicit causal graphs via a multi-agent graph-generation pipeline. To address observed gaps, the authors propose CFGPT, a two-stage post-training framework that transfers textual causal reasoning to video grounding and reinforces it with visual-causal alignment rewards, achieving substantial gains over baselines across all levels. The results highlight that specialized, causally informed training beats mere model scaling and pave a path toward more reliable, causally aware AI systems for dynamic real-world video understanding.

Abstract

Vision Language Models (VLMs) have recently shown significant advancements in video understanding, especially in feature alignment, event reasoning, and instruction-following tasks. However, their capability for counterfactual reasoning, inferring alternative outcomes under hypothetical conditions, remains underexplored. This capability is essential for robust video understanding, as it requires identifying underlying causal structures and reasoning about unobserved possibilities, rather than merely recognizing observed patterns. To systematically evaluate this capability, we introduce CounterVQA, a video-based benchmark featuring three progressive difficulty levels that assess different aspects of counterfactual reasoning. Through comprehensive evaluation of both state-of-the-art open-source and closed-source models, we uncover a substantial performance gap: while these models achieve reasonable accuracy on simple counterfactual questions, performance degrades significantly on complex multi-hop causal chains. To address these limitations, we develop a post-training method, CFGPT, that enhances a model's visual counterfactual reasoning ability by distilling its counterfactual reasoning capability from the language modality, yielding consistent improvements across all CounterVQA difficulty levels. Dataset and code will be further released.

Paper Structure

This paper contains 48 sections, 10 equations, 11 figures, 4 tables.

Figures (11)

  • Figure 1: The left panel illustrates typical wrong cases where current VLMs misinterpret causal relations when answering counterfactual video questions. The right panel presents our dataset generation pipeline: a multi-agent system infers pairwise causal relations to build causal graphs for each video, ranks videos by causal graph complexity, and generates three levels of counterfactual questions, adjacent, long-chain, and non-existent event inference, for systematic evaluation.
  • Figure 2: Representative examples from the three difficulty levels of CounterVQA. Level 1: Adjacent counterfactual inference requires reasoning about direct causal dependencies between consecutive events. Level 2: Long-chain counterfactual inference involves tracing multi-hop causal relationships across several actions. Level 3: Counterfactual inference with non-existent events demands reasoning about hypothetical scenarios that did not occur in the observed video.
  • Figure 3: Overview of the CFGPT framework. Top: Two-stage pipeline from a base VLM to the final CFGPT model. Left (orange): Cross-modal causal transfer through supervised fine-tuning using V+A generated CoT data. Right (green): Visual-causal reinforcement via GRPO, where candidate outputs are scored by causal graph consistency and visual grounding rewards to refine counterfactual reasoning.
  • Figure 4: Observer Agent Prompt
  • Figure 5: Verifier Agent Prompt
  • ...and 6 more figures