Table of Contents
Fetching ...

ChainReaction: Causal Chain-Guided Reasoning for Modular and Explainable Causal-Why Video Question Answering

Paritosh Parmar, Eric Peh, Basura Fernando

TL;DR

This work tackles Causal-Why VideoQA by introducing a modular two-stage framework that decouples video understanding from causal inference via natural language causal chains. A Causal Chain Extractor (CCE) identifies chain-level explanations from videos conditioned on questions, while a Causal Chain-Driven Answerer (CCDA) grounds answer choices in these chains, with stage-wise training to preserve causal fidelity. To enable supervised learning, the authors construct a large, human-verified causal-chain dataset (46,024 samples) and introduce CauCo, a causal-coherence metric to evaluate chain quality. Experiments across three datasets show strong performance gains and, crucially, enhanced explainability and trust, including human studies, and demonstrate good out-of-domain generalization, suggesting the approach can serve as a reusable causal-reasoning engine for diverse video understanding tasks.

Abstract

Existing Causal-Why Video Question Answering (VideoQA) models often struggle with higher-order reasoning, relying on opaque, monolithic pipelines that entangle video understanding, causal inference, and answer generation. These black-box approaches offer limited interpretability and tend to depend on shallow heuristics. We propose a novel, modular paradigm that explicitly decouples causal reasoning from answer generation, introducing natural language causal chains as interpretable intermediate representations. Inspired by human cognitive models, these structured cause-effect sequences bridge low-level video content with high-level causal reasoning, enabling transparent and logically coherent inference. Our two-stage architecture comprises a Causal Chain Extractor (CCE) that generates causal chains from video-question pairs, and a Causal Chain-Driven Answerer (CCDA) that derives answers grounded in these chains. To address the lack of annotated reasoning traces, we introduce a scalable method for generating accurate causal chains from existing datasets. We construct human verified causal chains for 46K samples. We also propose CauCo, a new evaluation metric for causality-oriented captioning. Experiments on three large-scale benchmarks demonstrate that our approach not only outperforms state-of-the-art models, but also yields substantial gains in explainability, user trust, and generalization -- positioning the CCE as a reusable causal reasoning engine across diverse domains. Project page: https://paritoshparmar.github.io/chainreaction/

ChainReaction: Causal Chain-Guided Reasoning for Modular and Explainable Causal-Why Video Question Answering

TL;DR

This work tackles Causal-Why VideoQA by introducing a modular two-stage framework that decouples video understanding from causal inference via natural language causal chains. A Causal Chain Extractor (CCE) identifies chain-level explanations from videos conditioned on questions, while a Causal Chain-Driven Answerer (CCDA) grounds answer choices in these chains, with stage-wise training to preserve causal fidelity. To enable supervised learning, the authors construct a large, human-verified causal-chain dataset (46,024 samples) and introduce CauCo, a causal-coherence metric to evaluate chain quality. Experiments across three datasets show strong performance gains and, crucially, enhanced explainability and trust, including human studies, and demonstrate good out-of-domain generalization, suggesting the approach can serve as a reusable causal-reasoning engine for diverse video understanding tasks.

Abstract

Existing Causal-Why Video Question Answering (VideoQA) models often struggle with higher-order reasoning, relying on opaque, monolithic pipelines that entangle video understanding, causal inference, and answer generation. These black-box approaches offer limited interpretability and tend to depend on shallow heuristics. We propose a novel, modular paradigm that explicitly decouples causal reasoning from answer generation, introducing natural language causal chains as interpretable intermediate representations. Inspired by human cognitive models, these structured cause-effect sequences bridge low-level video content with high-level causal reasoning, enabling transparent and logically coherent inference. Our two-stage architecture comprises a Causal Chain Extractor (CCE) that generates causal chains from video-question pairs, and a Causal Chain-Driven Answerer (CCDA) that derives answers grounded in these chains. To address the lack of annotated reasoning traces, we introduce a scalable method for generating accurate causal chains from existing datasets. We construct human verified causal chains for 46K samples. We also propose CauCo, a new evaluation metric for causality-oriented captioning. Experiments on three large-scale benchmarks demonstrate that our approach not only outperforms state-of-the-art models, but also yields substantial gains in explainability, user trust, and generalization -- positioning the CCE as a reusable causal reasoning engine across diverse domains. Project page: https://paritoshparmar.github.io/chainreaction/

Paper Structure

This paper contains 24 sections, 6 figures, 6 tables.

Figures (6)

  • Figure 1: (Top) Concept.(1) Existing Video ($\mathcal{V}$) Question ($\mathcal{Q}$) Answer ($\mathcal{A}$) approaches through the lens of structural causal models (SCMs), highlighting their monolithic and black-box nature. (2) In contrast, we propose a principled departure from this paradigm: leveraging the Causal Reasoning Trace ($\mathcal{C}$), a structured intermediate representation based on natural language causal chains. We factorize this SCM into two SCMs (3,4)---enabling structured video understanding, reasoning, and inference---leading to superior explainability and performance. (Bottom) Example.Please zoom in for the best view.
  • Figure 2: Causal chain construction for SFT.(1) Human annotators of base datasets intuitively and implicitly make use of causal chains when writing correct answers. (2) We propose to recover these causal chains with the help of LLM using questions and correct gold answers. (3,4) Our robust causal chain generation and manual verification and video grounding check pipeline.
  • Figure 3: Stage-wise training of our model.
  • Figure 4: Causal chain ablation study.(1) Study-I: QA accuracy drops by 73% when chains are perturbed. (2) Study-II: drop in QA is correlated to amount of perturbation. (3,4) Qualitative example. OCC: original causal chains, MCC: masked chains, SA: selected answer, PCC: perturbed chains. SA changes intuitively as chains are perturbed. Please zoom-in.
  • Figure 5: Qualitative results. GT AO: Groundtruth Answer Option; BM: Baseline Model; SA: Selected Answer; CC: Causal Chain. Only a few frames per video are shown. Green and red boxes indicate success and failure cases. In the first example, actor masks come from the CausalVidQA dataset, which includes reference-based QA. Please zoom-in.
  • ...and 1 more figures