ChainReaction: Causal Chain-Guided Reasoning for Modular and Explainable Causal-Why Video Question Answering
Paritosh Parmar, Eric Peh, Basura Fernando
TL;DR
This work tackles Causal-Why VideoQA by introducing a modular two-stage framework that decouples video understanding from causal inference via natural language causal chains. A Causal Chain Extractor (CCE) identifies chain-level explanations from videos conditioned on questions, while a Causal Chain-Driven Answerer (CCDA) grounds answer choices in these chains, with stage-wise training to preserve causal fidelity. To enable supervised learning, the authors construct a large, human-verified causal-chain dataset (46,024 samples) and introduce CauCo, a causal-coherence metric to evaluate chain quality. Experiments across three datasets show strong performance gains and, crucially, enhanced explainability and trust, including human studies, and demonstrate good out-of-domain generalization, suggesting the approach can serve as a reusable causal-reasoning engine for diverse video understanding tasks.
Abstract
Existing Causal-Why Video Question Answering (VideoQA) models often struggle with higher-order reasoning, relying on opaque, monolithic pipelines that entangle video understanding, causal inference, and answer generation. These black-box approaches offer limited interpretability and tend to depend on shallow heuristics. We propose a novel, modular paradigm that explicitly decouples causal reasoning from answer generation, introducing natural language causal chains as interpretable intermediate representations. Inspired by human cognitive models, these structured cause-effect sequences bridge low-level video content with high-level causal reasoning, enabling transparent and logically coherent inference. Our two-stage architecture comprises a Causal Chain Extractor (CCE) that generates causal chains from video-question pairs, and a Causal Chain-Driven Answerer (CCDA) that derives answers grounded in these chains. To address the lack of annotated reasoning traces, we introduce a scalable method for generating accurate causal chains from existing datasets. We construct human verified causal chains for 46K samples. We also propose CauCo, a new evaluation metric for causality-oriented captioning. Experiments on three large-scale benchmarks demonstrate that our approach not only outperforms state-of-the-art models, but also yields substantial gains in explainability, user trust, and generalization -- positioning the CCE as a reusable causal reasoning engine across diverse domains. Project page: https://paritoshparmar.github.io/chainreaction/
