Compositional Physical Reasoning of Objects and Events from Videos

Zhenfang Chen; Shilong Dong; Kexin Yi; Yunzhu Li; Mingyu Ding; Antonio Torralba; Joshua B. Tenenbaum; Chuang Gan

Compositional Physical Reasoning of Objects and Events from Videos

Zhenfang Chen, Shilong Dong, Kexin Yi, Yunzhu Li, Mingyu Ding, Antonio Torralba, Joshua B. Tenenbaum, Chuang Gan

TL;DR

The paper tackles the challenge of inferring hidden intrinsic physical properties, such as mass and charge, from limited videos and using them to predict object dynamics. It introduces ComPhy, a dataset with synthetic and real-world videos, and the PCR framework, a neuro-symbolic architecture with modules for perception, property grounding, hidden-property inference, dynamics prediction, and differentiable symbolic execution. Through curriculum learning and learning by imagination, PCR demonstrates strong performance on factual, predictive, and counterfactual reasoning tasks and shows advantages over state-of-the-art baselines and LVLMs, while highlighting gaps in generalization to more complex scenes. The work also explores integrating PCR with LVLMs to enhance language parsing and commonsense reasoning, and provides a real-world dataset to assess generalization beyond simulation, underscoring the importance of hidden physical properties for robust physical scene understanding.

Abstract

Understanding and reasoning about objects' physical properties in the natural world is a fundamental challenge in artificial intelligence. While some properties like colors and shapes can be directly observed, others, such as mass and electric charge, are hidden from the objects' visual appearance. This paper addresses the unique challenge of inferring these hidden physical properties from objects' motion and interactions and predicting corresponding dynamics based on the inferred physical properties. We first introduce the Compositional Physical Reasoning (ComPhy) dataset. For a given set of objects, ComPhy includes limited videos of them moving and interacting under different initial conditions. The model is evaluated based on its capability to unravel the compositional hidden properties, such as mass and charge, and use this knowledge to answer a set of questions. Besides the synthetic videos from simulators, we also collect a real-world dataset to show further test physical reasoning abilities of different models. We evaluate state-of-the-art video reasoning models on ComPhy and reveal their limited ability to capture these hidden properties, which leads to inferior performance. We also propose a novel neuro-symbolic framework, Physical Concept Reasoner (PCR), that learns and reasons about both visible and hidden physical properties from question answering. After training, PCR demonstrates remarkable capabilities. It can detect and associate objects across frames, ground visible and hidden physical properties, make future and counterfactual predictions, and utilize these extracted representations to answer challenging questions.

Compositional Physical Reasoning of Objects and Events from Videos

TL;DR

Abstract

Paper Structure (33 sections, 3 equations, 23 figures, 13 tables)

This paper contains 33 sections, 3 equations, 23 figures, 13 tables.

Introduction
Related Work
Dataset
Videos
Questions
Balancing and Statistics
Real-World Datasets
Experiments
Baselines
Evaluation on physical reasoning
Models
Model
Video Perceiver
Visible property grounder
Physical Property Inferencer
...and 18 more sections

Figures (23)

Figure 1: Non-visual properties like mass and charge govern the interaction between objects and lead to different motion trajectories. a) Objects attract and repel each other according to the (sign of) charge they carry. b) Mass determines how much an object's trajectory is perturbed during an interaction. Heavier objects have more stable motion.
Figure 2: Sample target video, reference videos and question-answer pairs from ComPhy.
Figure 3: Samples of real data. We collect real objects of different mass values and magnetism for extensive experiments, which have a significant effect on objects' motion and interaction.
Figure 4: The perception module detects objects' location and visual appearance attributes. The physical property learner learns objects' properties based on detected object trajectories. The dynamic predictor predicts objects' dynamics in the counterfactual scene based on objects' properties and locations. Finally, an execution engine runs the program parsed by the language parser on the predicted dynamic scene to answer the question.
Figure 5: A qualitative example of PCR on ComPhy. The left-up blue box shows the original video and a counterfactual question to answer. The right half table shows the executable program sequence parsed from the question with concepts related to it and outputs after execution. Specifically, the left-down chart illustrates the execution process of PCR for the program "counterfact charge": 1. PCR utilizes a PPI to parse factual charge properties of objects in the scene; 2. PCR modifies their properties according to the counterfactual concept and predicts new dynamics using a dynamic predictor.
...and 18 more figures

Compositional Physical Reasoning of Objects and Events from Videos

TL;DR

Abstract

Compositional Physical Reasoning of Objects and Events from Videos

Authors

TL;DR

Abstract

Table of Contents

Figures (23)