Table of Contents
Fetching ...

IntPhys 2: Benchmarking Intuitive Physics Understanding In Complex Synthetic Environments

Florian Bordes, Quentin Garrido, Justine T Kao, Adina Williams, Michael Rabbat, Emmanuel Dupoux

TL;DR

IntPhys 2 advances intuitive physics benchmarking by introducing a photorealistic, occlusion-focused video dataset that tests four core principles (Permanence, Immutability, Spatio-Temporal Continuity, Solidity) under violation-of-expectation. The authors evaluate state-of-the-art multimodal models and predictive methods against human performance, revealing a substantial gap: humans perform near perfectly, while current AI systems struggle and largely operate at chance, even in easier subsets. Through detailed human, MLLM, and prediction-based evaluations, the work highlights memory, context-length, and prompting as critical bottlenecks, and demonstrates that increasing realism and occlusion complexity intensifies the challenge. The study concludes that significant architectural and training-method innovations are needed to approach human-like intuitive physics understanding in complex, dynamic environments.

Abstract

We present IntPhys 2, a video benchmark designed to evaluate the intuitive physics understanding of deep learning models. Building on the original IntPhys benchmark, IntPhys 2 focuses on four core principles related to macroscopic objects: Permanence, Immutability, Spatio-Temporal Continuity, and Solidity. These conditions are inspired by research into intuitive physical understanding emerging during early childhood. IntPhys 2 offers a comprehensive suite of tests, based on the violation of expectation framework, that challenge models to differentiate between possible and impossible events within controlled and diverse virtual environments. Alongside the benchmark, we provide performance evaluations of several state-of-the-art models. Our findings indicate that while these models demonstrate basic visual understanding, they face significant challenges in grasping intuitive physics across the four principles in complex scenes, with most models performing at chance levels (50%), in stark contrast to human performance, which achieves near-perfect accuracy. This underscores the gap between current models and human-like intuitive physics understanding, highlighting the need for advancements in model architectures and training methodologies.

IntPhys 2: Benchmarking Intuitive Physics Understanding In Complex Synthetic Environments

TL;DR

IntPhys 2 advances intuitive physics benchmarking by introducing a photorealistic, occlusion-focused video dataset that tests four core principles (Permanence, Immutability, Spatio-Temporal Continuity, Solidity) under violation-of-expectation. The authors evaluate state-of-the-art multimodal models and predictive methods against human performance, revealing a substantial gap: humans perform near perfectly, while current AI systems struggle and largely operate at chance, even in easier subsets. Through detailed human, MLLM, and prediction-based evaluations, the work highlights memory, context-length, and prompting as critical bottlenecks, and demonstrates that increasing realism and occlusion complexity intensifies the challenge. The study concludes that significant architectural and training-method innovations are needed to approach human-like intuitive physics understanding in complex, dynamic environments.

Abstract

We present IntPhys 2, a video benchmark designed to evaluate the intuitive physics understanding of deep learning models. Building on the original IntPhys benchmark, IntPhys 2 focuses on four core principles related to macroscopic objects: Permanence, Immutability, Spatio-Temporal Continuity, and Solidity. These conditions are inspired by research into intuitive physical understanding emerging during early childhood. IntPhys 2 offers a comprehensive suite of tests, based on the violation of expectation framework, that challenge models to differentiate between possible and impossible events within controlled and diverse virtual environments. Alongside the benchmark, we provide performance evaluations of several state-of-the-art models. Our findings indicate that while these models demonstrate basic visual understanding, they face significant challenges in grasping intuitive physics across the four principles in complex scenes, with most models performing at chance levels (50%), in stark contrast to human performance, which achieves near-perfect accuracy. This underscores the gap between current models and human-like intuitive physics understanding, highlighting the need for advancements in model architectures and training methodologies.

Paper Structure

This paper contains 24 sections, 7 figures, 4 tables.

Figures (7)

  • Figure 1: Example of a scene in IntPhys2, which follows a similar design to IntPhys1. Each scene consists of a set of four videos. Two pairs depict possible outcomes, while the other two represent impossible outcomes. The presence of an obstacle or occluder determines the outcome: a possible outcome in the first pair becomes impossible in the second, and vice versa. In this example, a silver ball rolls down a path. If a brick obstacle is present, the ball should collide with it and change its trajectory. If the ball passes through the brick obstacle without altering its path, this outcome is deemed impossible. Conversely, when no obstacle is present, the ball's trajectory should remain unchanged, making this the likely outcome.
  • Figure 2: Benchmark Comparison. Our analysis compares four benchmarks: GRASP jassim_grasp_2024, InfLevel weihs2022benchmarking, IntPhys riochet2018intphys, and IntPhys2 (Ours). The benchmarks differ in their number of videos (simulated, real, test, development, debug, main, held-out), experimental setups, and physical properties assessed (permanence, immutability, continuity, solidity, inertia, gravity, collision, support). The density plots illustrate the distribution of occlusion durations (in seconds) for each benchmark. In contrast to other benchmarks, IntPhys2 covers a higher range of occlusion durations, allowing for a better assessment of a model's short-term memory. Camera settings vary between static and moving configurations. Example frames from each benchmark are shown on the right.
  • Figure 3: Evaluation of model's sensitivity. (left) We conducted an ablation study examining various factors, including sensitivity to different prompts and the model's variability in responses to identical inputs, as well as the difficulty level of the data. (right) We illustrate how a model's performance varies with the number of frames it receives. Our findings indicate that most models struggle to effectively make use of an increased number of input frames.
  • Figure 4: Results for predictive models. (left) When measuring whether models exhibit a higher surprise for impossible instances within a pair, we find that all tested models perform around chance level. (middle) This translates to the harder setting of single video classification, where the performance remains around chance. (right) Focusing on camera movements, one of the key chances in IntPhys 2, we find that model also struggle across camera settings. Confidence intervals obtained via bootstrapping.
  • Figure 5: Example of different tasks and environments that are in the IntPhys2 benchmark.
  • ...and 2 more figures