Table of Contents
Fetching ...

Finding the Trigger: Causal Abductive Reasoning on Video Events

Thao Minh Le, Vuong Le, Kien Do, Sunil Gupta, Svetha Venkatesh, Truyen Tran

TL;DR

The paper introduces CARVE, a task to identify trigger events that causally lead to a target event in video sequences, and presents two counterfactual datasets (CARVE and EpicKitchen-AR) to train and evaluate models. It proposes CERN, a Causal Event Relation Network that builds temporal-semantic event graphs and uses a skip-connected, multi-layer message passing scheme to infer trigger events, achieving improvements over strong baselines. Through extensive experiments, the authors demonstrate the necessity of explicit event-relational representations and abductive reasoning for video understanding, while also revealing limitations of current large video models in this new task. The work sets up a new research direction in video causal reasoning with practical implications for surveillance, root-cause analysis, and content management, and provides datasets, a modeling framework, and analysis to guide future development.

Abstract

This paper introduces a new problem, Causal Abductive Reasoning on Video Events (CARVE), which involves identifying causal relationships between events in a video and generating hypotheses about causal chains that account for the occurrence of a target event. To facilitate research in this direction, we create two new benchmark datasets with both synthetic and realistic videos, accompanied by trigger-target labels generated through a novel counterfactual synthesis approach. To explore the challenge of solving CARVE, we present a Causal Event Relation Network (CERN) that examines the relationships between video events in temporal and semantic spaces to efficiently determine the root-cause trigger events. Through extensive experiments, we demonstrate the critical roles of event relational representation learning and interaction modeling in solving video causal reasoning challenges. The introduction of the CARVE task, along with the accompanying datasets and the CERN framework, will advance future research on video causal reasoning and significantly facilitate various applications, including video surveillance, root-cause analysis and movie content management.

Finding the Trigger: Causal Abductive Reasoning on Video Events

TL;DR

The paper introduces CARVE, a task to identify trigger events that causally lead to a target event in video sequences, and presents two counterfactual datasets (CARVE and EpicKitchen-AR) to train and evaluate models. It proposes CERN, a Causal Event Relation Network that builds temporal-semantic event graphs and uses a skip-connected, multi-layer message passing scheme to infer trigger events, achieving improvements over strong baselines. Through extensive experiments, the authors demonstrate the necessity of explicit event-relational representations and abductive reasoning for video understanding, while also revealing limitations of current large video models in this new task. The work sets up a new research direction in video causal reasoning with practical implications for surveillance, root-cause analysis, and content management, and provides datasets, a modeling framework, and analysis to guide future development.

Abstract

This paper introduces a new problem, Causal Abductive Reasoning on Video Events (CARVE), which involves identifying causal relationships between events in a video and generating hypotheses about causal chains that account for the occurrence of a target event. To facilitate research in this direction, we create two new benchmark datasets with both synthetic and realistic videos, accompanied by trigger-target labels generated through a novel counterfactual synthesis approach. To explore the challenge of solving CARVE, we present a Causal Event Relation Network (CERN) that examines the relationships between video events in temporal and semantic spaces to efficiently determine the root-cause trigger events. Through extensive experiments, we demonstrate the critical roles of event relational representation learning and interaction modeling in solving video causal reasoning challenges. The introduction of the CARVE task, along with the accompanying datasets and the CERN framework, will advance future research on video causal reasoning and significantly facilitate various applications, including video surveillance, root-cause analysis and movie content management.
Paper Structure (26 sections, 12 equations, 6 figures, 4 tables, 1 algorithm)

This paper contains 26 sections, 12 equations, 6 figures, 4 tables, 1 algorithm.

Figures (6)

  • Figure 1: Illustration of event chains and trigger-target event pairs in $\text{CARVE}$ task and example in the accompanying dataset. Videos are generated by a 2D physics simulator using predefined visual scenes. Video events, < object_id_1, object_id_2, interaction_type, start_time, end_time>, are defined as interactions between a dynamic object and an object partner within a time interval. Each dynamic object creates a chain of events (3 in this example, marked with corresponding colors). These chains are merged by mutual events, resulting in a graph of events. Explanation and target event pairs are identified by comparing the original video and counterfactual videos. Best viewed in color.
  • Figure 2: Distribution of target and trigger event locations in time in $\text{CARVE}$. Only 100 first events are visible.
  • Figure 3: Overview of $\text{CERN}$. Given a target event $e^{\text{target}}$ and its premise events, we build a directed graph of events $\mathcal{G}$ based on their temporal distance. Edge feature vectors $r_{ij}$ (arrows) represent multi-aspect relations between events. We use a novel message passing scheme (Msg) to refine events in consideration of their surrounding events. Gray box illustrates how event $e_{4}$ receives information from preceding events $e_{1},e_{3},e_{5}$ to refine its representation. Refined event features are then bound with the target event and eventually mapped into scores for label prediction.
  • Figure 4: Message passing with skip connections along “ message” and “ layer” axis. $\mathcal{F}^{\text{msg}}$ and $\mathcal{F}^{\text{layer}}$ are non-linear functions. Illustrating with two neighbors $j$ and $k$ of node $i$.
  • Figure 5: Qualitative examples demonstrating that sequential models struggle to identify correct trigger events while $\text{CERN}$ handles successfully. Sequential models tend to predict events that occur early in time as the trigger events and often incapable of tracing back events associated with a co-referenced object between the target and trigger event. The graph structure learned by $\text{CERN}$ has potential to facilitate the propagation of information along chains of events associated with dynamic objects, thereby benefiting the learning. Colored arrows indicate the direction of the movement of the corresponding objects. Best viewed in color.
  • ...and 1 more figures