Table of Contents
Fetching ...

CLEVRER-Humans: Describing Physical and Causal Events the Human Way

Jiayuan Mao, Xuelin Yang, Xikun Zhang, Noah D. Goodman, Jiajun Wu

TL;DR

CLEVRER-Humans introduces a human-annotated, language-rich representation of physical events and their causality via Causal Event Graphs (CEGs) built on CLEVRER footage. A three-stage data collection and augmentation pipeline—iterative causal cloze, trajectory-based event generation, and CEG condensation—produces dense, QA-ready data that captures diverse event types and nuanced human judgments. Baseline evaluations reveal that models struggle with the expanded vocabulary and human-like causal judgments, underscoring the need for data-efficient, physics-grounded language understanding. The dataset advances both machine learning and cognitive science by providing a challenging benchmark for grounding natural language in dynamic physical scenes and for studying human causal perception.

Abstract

Building machines that can reason about physical events and their causal relationships is crucial for flexible interaction with the physical world. However, most existing physical and causal reasoning benchmarks are exclusively based on synthetically generated events and synthetic natural language descriptions of causal relationships. This design brings up two issues. First, there is a lack of diversity in both event types and natural language descriptions; second, causal relationships based on manually-defined heuristics are different from human judgments. To address both shortcomings, we present the CLEVRER-Humans benchmark, a video reasoning dataset for causal judgment of physical events with human labels. We employ two techniques to improve data collection efficiency: first, a novel iterative event cloze task to elicit a new representation of events in videos, which we term Causal Event Graphs (CEGs); second, a data augmentation technique based on neural language generative models. We convert the collected CEGs into questions and answers to be consistent with prior work. Finally, we study a collection of baseline approaches for CLEVRER-Humans question-answering, highlighting the great challenges set forth by our benchmark.

CLEVRER-Humans: Describing Physical and Causal Events the Human Way

TL;DR

CLEVRER-Humans introduces a human-annotated, language-rich representation of physical events and their causality via Causal Event Graphs (CEGs) built on CLEVRER footage. A three-stage data collection and augmentation pipeline—iterative causal cloze, trajectory-based event generation, and CEG condensation—produces dense, QA-ready data that captures diverse event types and nuanced human judgments. Baseline evaluations reveal that models struggle with the expanded vocabulary and human-like causal judgments, underscoring the need for data-efficient, physics-grounded language understanding. The dataset advances both machine learning and cognitive science by providing a challenging benchmark for grounding natural language in dynamic physical scenes and for studying human causal perception.

Abstract

Building machines that can reason about physical events and their causal relationships is crucial for flexible interaction with the physical world. However, most existing physical and causal reasoning benchmarks are exclusively based on synthetically generated events and synthetic natural language descriptions of causal relationships. This design brings up two issues. First, there is a lack of diversity in both event types and natural language descriptions; second, causal relationships based on manually-defined heuristics are different from human judgments. To address both shortcomings, we present the CLEVRER-Humans benchmark, a video reasoning dataset for causal judgment of physical events with human labels. We employ two techniques to improve data collection efficiency: first, a novel iterative event cloze task to elicit a new representation of events in videos, which we term Causal Event Graphs (CEGs); second, a data augmentation technique based on neural language generative models. We convert the collected CEGs into questions and answers to be consistent with prior work. Finally, we study a collection of baseline approaches for CLEVRER-Humans question-answering, highlighting the great challenges set forth by our benchmark.
Paper Structure (40 sections, 13 figures, 6 tables)

This paper contains 40 sections, 13 figures, 6 tables.

Figures (13)

  • Figure 1: For (a) each video in the CLEVRER dataset, (b) CLEVRER-Humans annotates a human-labeled graphical representation of physical events and their causal relations, in the form of causal event graphs (CEGs). Each CEG composes of a collection of nodes associated with textual descriptions of events, and human-labeled directional edges indicating the causal relationship between objects. Each edge is also associated with a score indicating a human's graded attribute of causal relations. (c) The compact representation of CEGs can be easily translated into question-answer pairs to evaluate video reasoning models.
  • Figure 2: This are two examples showing difference of human causal judgment and CLEVRER's heuristic causal relation. The arrows in the image represent the moving direction of objects of interest.
  • Figure 3: The overall labeling pipeline of CLEVRER-Humans. (a) Starting from input videos, (b) we use a event cloze task to collect a small number of human-written event descriptions about videos (Stage I). (c) Next, we train neural event description generators to augment all videos with a collection of events (Stage II). (d) Finally, human annotators label the correctness of all generated events (in this case, event E is incorrect and thus the node is dropped) as well as their causal relations (Stage III).
  • Figure 4: We use a novel iterative data collection procedure to collect CEGs on MTurk. Starting from a single node (iteration 0, blue), we iteratively sample nodes in the current CEG and collect either a cause or an effect event, and add this new node to the CEG. Red: nodes added in the first iteration; yellow: nodes added in the second iteration.
  • Figure 5: The neural event description generation pipeline (Stage II). (a) Given the input video, (b) we first extract per-object trajectories, composed of their attributes, positions, velocities, and angular velocities. For each object, we use a single-object event generator to sample event descriptions. For each pair of objects, we use a cascaded generator composed of a rule-based event detector and a neural pairwise generator. All generated events will pass a post-processing unit composed of three stages: grammar check, object existence check, and verb re-balancing. (c) The final product of the pipeline is a candidate node set of the CEG, which will be further annotated by humans in Stage III, the CEG condensation stage.
  • ...and 8 more figures