Table of Contents
Fetching ...

CausalChaos! Dataset for Comprehensive Causal Action Question Answering Over Longer Causal Chains Grounded in Dynamic Visual Scenes

Paritosh Parmar, Eric Peh, Ruirui Chen, Ting En Lam, Yuhan Chen, Elston Tan, Basura Fernando

TL;DR

This work addresses the need for deeper causal reasoning in video question answering by introducing CausalChaos!, a challenging dataset built from the Tom & Jerry cartoon corpus. It emphasizes long, well-defined causal chains and provides multi-level explanations to accompany answers, with both MCQA and open-ended QA tasks and hard negative mining to prevent shortcuts. Across extensive experiments, state-of-the-art baselines struggle to perform causal reasoning, though some gains are achieved by specialized models and by leveraging large language models; results also show that training on this synthetic dataset can transfer benefits to real-world datasets. The dataset highlights dynamic scene linking and animation-informed cues as crucial for resolving complex causal queries, underscoring the need for improved joint vision-language modeling and causal reasoning capabilities in video understanding systems. The authors also release a dedicated causal-confusion test set to further stress test causal reasoning in VideoQA and propose directions for future work in explicit causal modeling and open-ended answer generation.

Abstract

Causal video question answering (QA) has garnered increasing interest, yet existing datasets often lack depth in causal reasoning. To address this gap, we capitalize on the unique properties of cartoons and construct CausalChaos!, a novel, challenging causal Why-QA dataset built upon the iconic "Tom and Jerry" cartoon series. Cartoons use the principles of animation that allow animators to create expressive, unambiguous causal relationships between events to form a coherent storyline. Utilizing these properties, along with thought-provoking questions and multi-level answers (answer and detailed causal explanation), our questions involve causal chains that interconnect multiple dynamic interactions between characters and visual scenes. These factors demand models to solve more challenging, yet well-defined causal relationships. We also introduce hard incorrect answer mining, including a causally confusing version that is even more challenging. While models perform well, there is much room for improvement, especially, on open-ended answers. We identify more advanced/explicit causal relationship modeling & joint modeling of vision and language as the immediate areas for future efforts to focus upon. Along with the other complementary datasets, our new challenging dataset will pave the way for these developments in the field.

CausalChaos! Dataset for Comprehensive Causal Action Question Answering Over Longer Causal Chains Grounded in Dynamic Visual Scenes

TL;DR

This work addresses the need for deeper causal reasoning in video question answering by introducing CausalChaos!, a challenging dataset built from the Tom & Jerry cartoon corpus. It emphasizes long, well-defined causal chains and provides multi-level explanations to accompany answers, with both MCQA and open-ended QA tasks and hard negative mining to prevent shortcuts. Across extensive experiments, state-of-the-art baselines struggle to perform causal reasoning, though some gains are achieved by specialized models and by leveraging large language models; results also show that training on this synthetic dataset can transfer benefits to real-world datasets. The dataset highlights dynamic scene linking and animation-informed cues as crucial for resolving complex causal queries, underscoring the need for improved joint vision-language modeling and causal reasoning capabilities in video understanding systems. The authors also release a dedicated causal-confusion test set to further stress test causal reasoning in VideoQA and propose directions for future work in explicit causal modeling and open-ended answer generation.

Abstract

Causal video question answering (QA) has garnered increasing interest, yet existing datasets often lack depth in causal reasoning. To address this gap, we capitalize on the unique properties of cartoons and construct CausalChaos!, a novel, challenging causal Why-QA dataset built upon the iconic "Tom and Jerry" cartoon series. Cartoons use the principles of animation that allow animators to create expressive, unambiguous causal relationships between events to form a coherent storyline. Utilizing these properties, along with thought-provoking questions and multi-level answers (answer and detailed causal explanation), our questions involve causal chains that interconnect multiple dynamic interactions between characters and visual scenes. These factors demand models to solve more challenging, yet well-defined causal relationships. We also introduce hard incorrect answer mining, including a causally confusing version that is even more challenging. While models perform well, there is much room for improvement, especially, on open-ended answers. We identify more advanced/explicit causal relationship modeling & joint modeling of vision and language as the immediate areas for future efforts to focus upon. Along with the other complementary datasets, our new challenging dataset will pave the way for these developments in the field.
Paper Structure (25 sections, 4 figures, 4 tables)

This paper contains 25 sections, 4 figures, 4 tables.

Figures (4)

  • Figure 1: (Left) Examples of causal questions about characters' actions from our CausalChaos! dataset---based on Tom & Jerry cartoon series. Q: question; A: answer; E: explanation. Please view in Adobe Reader to play the embedded videos for better explanation.(Middle) Illustration of causal chain, scene changes. Linking multiple clues/cues embedded in different scenes to solve causal relationships pose a challenge for VideoQA models. (Right) Animators leverage Principles of Animation to stylize the visuals & motions to disentangle/highlight key content of the scene to create well-defined/unambiguous and effectively communicated cause-and-effect relationships. The interplay of these factors allow models to focus on solving complex, yet, well-defined, unambiguous causal relationships.
  • Figure 2: (a) Types of reasoning demanded by our CausalChaos! dataset. Reasoning types: DR-deductive reasoning; IR-inductive; SR-spatial; CR-causal; CT-critical thinking; ER-emotional; AR-abductive; TR-temporal; None-no reasoning required as per the human subjects. None is undesirable, and tend to indicate that questions are less challenging. (b) Comparison among CausalChaos! and existing causal videoQA datasets. MA-multilevel answers; CCL-causal chain length; NOS-no. of scenes; RS-reasoning spectrum; MGA-multigranular actions. (c) Qualitative comparison between CausalChaos! and NextQA (Why-QA) datasets. CausalChaos! Answers and Explanations give detailed information regarding the actual cause-and-effect relationships, motivations, emotions covering wide range of reasoning types. Note that, we have temporally cropped videos to retain only the relevant parts from NextQA dataset videos; otherwise, raw videos are longer resulting in unintended problem of temporal localization for models.
  • Figure 3: Grounded in diverse visual information.
  • Figure 4: Traditional temporal modeling vs Dynamic scene linking. Notice the abrupt scene change, which causes disruption in visual flow, resulting in large amplitude and widespread optical flow.