Table of Contents
Fetching ...

CRAFT: A Benchmark for Causal Reasoning About Forces and inTeractions

Tayfun Ates, M. Samil Atesoglu, Cagatay Yigit, Ilker Kesen, Mert Kobas, Erkut Erdem, Aykut Erdem, Tilbe Goksun, Deniz Yuret

TL;DR

CRAFT introduces a challenging video QA benchmark to evaluate causal reasoning about forces and interactions in synthetic 2D physics scenes. It combines Descriptive, Counterfactual, and Force Dynamics-inspired Causal questions (cause/enable/prevent) generated via functional programs with Box2D simulations, including initial/final states and causal graphs. The study shows humans outperform even strong multimodal models, revealing substantial gaps in current methods for dynamic physical and causal reasoning, especially under unseen scene layouts. It also provides a broad set of baselines and analysis, highlighting directions toward neuro-symbolic and object-centric reasoning to improve causal understanding in video QA. The dataset and results underscore the need for more sophisticated reasoning over physical interactions in AI systems.

Abstract

Humans are able to perceive, understand and reason about causal events. Developing models with similar physical and causal understanding capabilities is a long-standing goal of artificial intelligence. As a step towards this direction, we introduce CRAFT, a new video question answering dataset that requires causal reasoning about physical forces and object interactions. It contains 58K video and question pairs that are generated from 10K videos from 20 different virtual environments, containing various objects in motion that interact with each other and the scene. Two question categories in CRAFT include previously studied descriptive and counterfactual questions. Additionally, inspired by the Force Dynamics Theory in cognitive linguistics, we introduce a new causal question category that involves understanding the causal interactions between objects through notions like cause, enable, and prevent. Our results show that even though the questions in CRAFT are easy for humans, the tested baseline models, including existing state-of-the-art methods, do not yet deal with the challenges posed in our benchmark.

CRAFT: A Benchmark for Causal Reasoning About Forces and inTeractions

TL;DR

CRAFT introduces a challenging video QA benchmark to evaluate causal reasoning about forces and interactions in synthetic 2D physics scenes. It combines Descriptive, Counterfactual, and Force Dynamics-inspired Causal questions (cause/enable/prevent) generated via functional programs with Box2D simulations, including initial/final states and causal graphs. The study shows humans outperform even strong multimodal models, revealing substantial gaps in current methods for dynamic physical and causal reasoning, especially under unseen scene layouts. It also provides a broad set of baselines and analysis, highlighting directions toward neuro-symbolic and object-centric reasoning to improve causal understanding in video QA. The dataset and results underscore the need for more sophisticated reasoning over physical interactions in AI systems.

Abstract

Humans are able to perceive, understand and reason about causal events. Developing models with similar physical and causal understanding capabilities is a long-standing goal of artificial intelligence. As a step towards this direction, we introduce CRAFT, a new video question answering dataset that requires causal reasoning about physical forces and object interactions. It contains 58K video and question pairs that are generated from 10K videos from 20 different virtual environments, containing various objects in motion that interact with each other and the scene. Two question categories in CRAFT include previously studied descriptive and counterfactual questions. Additionally, inspired by the Force Dynamics Theory in cognitive linguistics, we introduce a new causal question category that involves understanding the causal interactions between objects through notions like cause, enable, and prevent. Our results show that even though the questions in CRAFT are easy for humans, the tested baseline models, including existing state-of-the-art methods, do not yet deal with the challenges posed in our benchmark.

Paper Structure

This paper contains 15 sections, 11 figures, 13 tables.

Figures (11)

  • Figure 1: Example CRAFT questions generated for a sample scene. There are 48 different tasks divided into three distinct categories for 20 different scenes. Besides having tasks questioning descriptive properties, possibly needing temporal reasoning, CRAFT introduces challenges including more complex tasks requiring single or multiple counterfactual analysis or understanding object intentions for deep causal reasoning.
  • Figure 2: Random configurations of static scene element properties for each scene. The opaque regions show the mean value for that element, whereas the overlayed regions show the extreme values. Although these changes may seem subtle, they provide a wide variety in terms of scene dynamics.
  • Figure 3: A simple causal graph. The causal graph is a graphical summary of the events that occur in a simulation. For the sake of simplicity, here we only include the interactions between the dynamic objects and the basket, and moreover, the scene is uncomplicated that there is no intermediate branching in the causal graph.
  • Figure 4: Distribution of question types and answers in CRAFT. Innermost layer represents the distribution of the questions for different task categories. Middle layer illustrates the distribution of the answer types for each task category. Outermost layer represents the distribution of answers for each answer type.
  • Figure A.1: Example programs for descriptive questions.
  • ...and 6 more figures