Table of Contents
Fetching ...

ContPhy: Continuum Physical Concept Learning and Reasoning from Videos

Zhicheng Zheng, Xin Yan, Zhenfang Chen, Jingzhou Wang, Qin Zhi Eddie Lim, Joshua B. Tenenbaum, Chuang Gan

TL;DR

ContPhy introduces a Continuum Physical Dataset to probe machine physical commonsense across soft bodies, fluids, and their interactions with rigid and articulated objects. It pairs a Unity-based physics generation pipeline with a designed question engine to produce diverse property and dynamics questions, and evaluates a spectrum of baselines (blind, visual, physical, MLLMs) alongside ContPRO, a neural–symbolic oracle that uses LLMs as program parsers and particle-based dynamics as the simulator. Across scenarios, current models show substantial gaps versus humans, especially for soft materials and fluids, while ContPRO achieves the strongest performance and even surpasses human accuracy on some predictive tasks. The work underscores the value of neural–symbolic approaches for integrating perception, simulation, and language reasoning to advance physical commonsense in continuum settings, and outlines directions for expanding language diversity and scene complexity in future benchmarks.

Abstract

We introduce the Continuum Physical Dataset (ContPhy), a novel benchmark for assessing machine physical commonsense. ContPhy complements existing physical reasoning benchmarks by encompassing the inference of diverse physical properties, such as mass and density, across various scenarios and predicting corresponding dynamics. We evaluated a range of AI models and found that they still struggle to achieve satisfactory performance on ContPhy, which shows that the current AI models still lack physical commonsense for the continuum, especially soft-bodies, and illustrates the value of the proposed dataset. We also introduce an oracle model (ContPRO) that marries the particle-based physical dynamic models with the recent large language models, which enjoy the advantages of both models, precise dynamic predictions, and interpretable reasoning. ContPhy aims to spur progress in perception and reasoning within diverse physical settings, narrowing the divide between human and machine intelligence in understanding the physical world. Project page: https://physical-reasoning-project.github.io

ContPhy: Continuum Physical Concept Learning and Reasoning from Videos

TL;DR

ContPhy introduces a Continuum Physical Dataset to probe machine physical commonsense across soft bodies, fluids, and their interactions with rigid and articulated objects. It pairs a Unity-based physics generation pipeline with a designed question engine to produce diverse property and dynamics questions, and evaluates a spectrum of baselines (blind, visual, physical, MLLMs) alongside ContPRO, a neural–symbolic oracle that uses LLMs as program parsers and particle-based dynamics as the simulator. Across scenarios, current models show substantial gaps versus humans, especially for soft materials and fluids, while ContPRO achieves the strongest performance and even surpasses human accuracy on some predictive tasks. The work underscores the value of neural–symbolic approaches for integrating perception, simulation, and language reasoning to advance physical commonsense in continuum settings, and outlines directions for expanding language diversity and scene complexity in future benchmarks.

Abstract

We introduce the Continuum Physical Dataset (ContPhy), a novel benchmark for assessing machine physical commonsense. ContPhy complements existing physical reasoning benchmarks by encompassing the inference of diverse physical properties, such as mass and density, across various scenarios and predicting corresponding dynamics. We evaluated a range of AI models and found that they still struggle to achieve satisfactory performance on ContPhy, which shows that the current AI models still lack physical commonsense for the continuum, especially soft-bodies, and illustrates the value of the proposed dataset. We also introduce an oracle model (ContPRO) that marries the particle-based physical dynamic models with the recent large language models, which enjoy the advantages of both models, precise dynamic predictions, and interpretable reasoning. ContPhy aims to spur progress in perception and reasoning within diverse physical settings, narrowing the divide between human and machine intelligence in understanding the physical world. Project page: https://physical-reasoning-project.github.io
Paper Structure (64 sections, 8 figures, 14 tables)

This paper contains 64 sections, 8 figures, 14 tables.

Figures (8)

  • Figure 1: The motivation is derived from a range of everyday soft materials and their interaction with rigid objects, whose physical behaviors or functions vary by their diverse physical properties. a) Gasoline flows more fluently than glue due to lower viscosity, while oil with lower density tends to float above water. b) Poplin and canvas exhibit surface wrinkles with varying granularity due to their distinct bending compliance. c) The lifting approach requires less force due to the re-distributed tensile forces facilitated by the movable pulley. d) Trajectories of tennis ball and dough ball demonstrate their differing elasticity and plasticity.
  • Figure 2: The figure presents samples from the four puzzle blocks of our Continuum Physical Dataset (ContPhy). ContPhy offers rendered outputs from the simulation of randomly sampled scenarios, accompanied by their respective question-answer pairs. These pairs span from understanding soft-body physical properties, concepts, and interactions with rigid objects through comparative analysis, to temporal and spatial dynamic predictions, counterfactual considerations, and goal-oriented problem-solving. It aims to provide a comprehensive resource for AI models to interpret the physical world of various deformable bodies. Note that they are best viewed in videos.
  • Figure 3: The architecture of the ContPRO model. With questions, predefined APIs, and specific prompts, an LLM will play as a program parser that translates questions into code snippets. The visual perception module predicts objects' location and static attributes. The physical simulation module predicts dynamics. The symbolic execution module executes the code snippet to output the answer.
  • Figure 4: Sensor data outputs are multimodal, depicting the 4D states of objects across various levels, ranging from object-level, point-level to event-level.
  • Figure 5: Question distribution of fluid, rope, cloth, and ball scenarios.
  • ...and 3 more figures