Table of Contents
Fetching ...

Intuitive physics understanding emerges from self-supervised pretraining on natural videos

Quentin Garrido, Nicolas Ballas, Mahmoud Assran, Adrien Bardes, Laurent Najman, Michael Rabbat, Emmanuel Dupoux, Yann LeCun

TL;DR

This work finds that video prediction models trained to predict outcomes in a learned representation space demonstrate an understanding of various intuitive physics properties, such as object permanence and shape consistency, in contrast to video prediction in pixel space and multimodal large language models, which reason through text.

Abstract

We investigate the emergence of intuitive physics understanding in general-purpose deep neural network models trained to predict masked regions in natural videos. Leveraging the violation-of-expectation framework, we find that video prediction models trained to predict outcomes in a learned representation space demonstrate an understanding of various intuitive physics properties, such as object permanence and shape consistency. In contrast, video prediction in pixel space and multimodal large language models, which reason through text, achieve performance closer to chance. Our comparisons of these architectures reveal that jointly learning an abstract representation space while predicting missing parts of sensory input, akin to predictive coding, is sufficient to acquire an understanding of intuitive physics, and that even models trained on one week of unique video achieve above chance performance. This challenges the idea that core knowledge -- a set of innate systems to help understand the world -- needs to be hardwired to develop an understanding of intuitive physics.

Intuitive physics understanding emerges from self-supervised pretraining on natural videos

TL;DR

This work finds that video prediction models trained to predict outcomes in a learned representation space demonstrate an understanding of various intuitive physics properties, such as object permanence and shape consistency, in contrast to video prediction in pixel space and multimodal large language models, which reason through text.

Abstract

We investigate the emergence of intuitive physics understanding in general-purpose deep neural network models trained to predict masked regions in natural videos. Leveraging the violation-of-expectation framework, we find that video prediction models trained to predict outcomes in a learned representation space demonstrate an understanding of various intuitive physics properties, such as object permanence and shape consistency. In contrast, video prediction in pixel space and multimodal large language models, which reason through text, achieve performance closer to chance. Our comparisons of these architectures reveal that jointly learning an abstract representation space while predicting missing parts of sensory input, akin to predictive coding, is sufficient to acquire an understanding of intuitive physics, and that even models trained on one week of unique video achieve above chance performance. This challenges the idea that core knowledge -- a set of innate systems to help understand the world -- needs to be hardwired to develop an understanding of intuitive physics.
Paper Structure (14 sections, 4 equations, 14 figures, 5 tables)

This paper contains 14 sections, 4 equations, 14 figures, 5 tables.

Figures (14)

  • Figure 1: Video prediction in representation space (V-JEPA) achieves an understanding of intuitive physics. (A) Video models are evaluated on three intuitive physics datasets using the Violation of Expectation paradigm (IntPhys, GRASP, and InfLevel). V-JEPA is significantly more 'surprised' by implausible videos. Random initializations of V-JEPA (untrained networks) show near-chance performance, and state-of-the-art video models based on text or pixel prediction are much closer to chance. Confidence intervals at 95% are obtained via bootstrapping, except for untrained networks ($n=20$) which use a normal distribution assumption. (B) V-JEPA is trained to 'inpaint' natural videos in a learned representation space. Starting from a video and a corrupted version, representations are first extracted. The goal is then to predict the representation of the original video from the representation of the corrupted ones. (C) From a trained V-JEPA, we compute a surprise metric by predicting representations of N future frames based on M past ones and comparing the predictions to the representations of observed events. The surprise metric is then used to decide which of the two videos contains a physical violation.
  • Figure 2: V-JEPA accuracy increase relative to randomly-initialized models and humans across different physical properties and benchmarks. (A) Because some benchmarks contain low-level biases, we test the model performance against a set of randomly initialized networks ($n=20$). V-JEPA models ($n=5$) have higher relative classification accuracy on intuitive physics benchmarks for most, but not all concepts. (B) V-JEPA relative (left) and absolute (right) accuracy on the IntPhys test set across different conditions compared to naive human performance, showing a high correlation between human and machine errors. The V-JEPA score uses the maximum surprise from each video, which generalizes better for single-video classification. Human data are taken from riochet_intphys_2022.
  • Figure 3: Influence of type of mask, type and amount of training data, and model size on V-JEPA IntPhys scores. (A) When pretrained on VM2M, V-JEPA exhibits an understanding of intuitive physics with every masking strategy. (B) Of the three training datasets, two give high accuracies when trained separately (K710 and Howto100M). High scores are found with only 1289 hours of Howto100M (the largest dataset), and even 128h gives better than chance performance. (C) While larger encoders improve performance, we find that the performance remains non-trivial across sizes when pretraining on HowTo100M. Confidence intervals obtained via bootstrapping.
  • Figure S1: Different surprise measures are better suited for different tasks. Focusing on IntPhys, we find that looking at the average surprise over a video leads to better performance when comparing pairs of videos. A one-sample t-test was performed to see if the relative surprises are greater than zero (left). However, when looking at individual videos' surprise, choosing the maximum surprise over a video leads to a better separation between possible and impossible videos. A two-sample t-test was performed to see if impossible videos have higher surprise than possible ones. (rigt).
  • Figure S2: Normalized probabilities output by Qwen2-VL-72B. When presented with a pair of videos, we find that the model outputs similar probabilities for possible and impossible videos.
  • ...and 9 more figures