Table of Contents
Fetching ...

The Power of Next-Frame Prediction for Learning Physical Laws

Thomas Winterbottom, G. Thomas Hudson, Daniel Kluvanec, Dean Slack, Jamie Sterling, Junjie Shentu, Chenghao Xiao, Zheming Zhou, Noura Al Moubayed

TL;DR

Problem: whether visual models can acquire an understanding of physical laws through unsupervised predictive tasks rather than explicit labelling. Approach: train two architectures (FCN CNN and Patch Transformer) on next-frame prediction across six diagnostic dynamic datasets and evaluate emergent physics through linear probing of frozen features. Contributions: six probing datasets, two model families, and a two-step protocol showing significant improvement over baselines in estimating physical constants like gravity after generative pretraining. Significance: demonstrates the inductive power of generative pretraining for learning physical reasoning in vision and highlights avenues for scaling pretraining to richer visual domains.

Abstract

Next-frame prediction is a useful and powerful method for modelling and understanding the dynamics of video data. Inspired by the empirical success of causal language modelling and next-token prediction in language modelling, we explore the extent to which next-frame prediction serves as a strong foundational learning strategy (analogous to language modelling) for inducing an understanding of the visual world. In order to quantify the specific visual understanding induced by next-frame prediction, we introduce six diagnostic simulation video datasets derived from fundamental physical laws created by varying physical constants such as gravity and mass. We demonstrate that our models trained only on next-frame prediction are capable of predicting the value of these physical constants (e.g. gravity) without having been trained directly to learn these constants via a regression task. We find that the generative training phase alone induces a model state that can predict physical constants significantly better than that of a random model, improving the loss by a factor of between 1.28 to 6.24. We conclude that next-frame prediction shows great promise as a general learning strategy to induce understanding of the many `laws' that govern the visual domain without the need for explicit labelling.

The Power of Next-Frame Prediction for Learning Physical Laws

TL;DR

Problem: whether visual models can acquire an understanding of physical laws through unsupervised predictive tasks rather than explicit labelling. Approach: train two architectures (FCN CNN and Patch Transformer) on next-frame prediction across six diagnostic dynamic datasets and evaluate emergent physics through linear probing of frozen features. Contributions: six probing datasets, two model families, and a two-step protocol showing significant improvement over baselines in estimating physical constants like gravity after generative pretraining. Significance: demonstrates the inductive power of generative pretraining for learning physical reasoning in vision and highlights avenues for scaling pretraining to richer visual domains.

Abstract

Next-frame prediction is a useful and powerful method for modelling and understanding the dynamics of video data. Inspired by the empirical success of causal language modelling and next-token prediction in language modelling, we explore the extent to which next-frame prediction serves as a strong foundational learning strategy (analogous to language modelling) for inducing an understanding of the visual world. In order to quantify the specific visual understanding induced by next-frame prediction, we introduce six diagnostic simulation video datasets derived from fundamental physical laws created by varying physical constants such as gravity and mass. We demonstrate that our models trained only on next-frame prediction are capable of predicting the value of these physical constants (e.g. gravity) without having been trained directly to learn these constants via a regression task. We find that the generative training phase alone induces a model state that can predict physical constants significantly better than that of a random model, improving the loss by a factor of between 1.28 to 6.24. We conclude that next-frame prediction shows great promise as a general learning strategy to induce understanding of the many `laws' that govern the visual domain without the need for explicit labelling.
Paper Structure (20 sections, 1 equation, 7 figures, 1 table)

This paper contains 20 sections, 1 equation, 7 figures, 1 table.

Figures (7)

  • Figure 1: Example visualisations of 6 sequential frames from each of our six proposed dynamic simulation video datasets. From top to bottom: Pendulum, Roller coaster with flight, Mars Moon, Colliding Blocks, 2D Bouncing balls, and 3D bouncing balls.
  • Figure 2: The two steps of training in our experiments. Step 1 trains the model to predict the next-frame of a given frame sequence. Step 2 takes the frozen weights of a model trained in Step 1 (or a randomly initialised model), extracts its latent representations through linear probes, and performs a regression task using the underlying constant directly.
  • Figure 3: Fully Convolutional 2D CNN model. Each convolution unit is made from two convolution layers, i.e. an initial convolution layer that changes the input resolution, followed by another of kernel size 1$\times$1 that does not. The arrows labelled 'Probe' indicate which points in the network are extracted to form linear probes used in Section \ref{['subsec_probe']}.
  • Figure 4: The Patch Transformer model, adapted from the SegFormer xie2021segformer for video generation. The arrows labelled 'Probe' indicate which points in the network are extracted to form linear probes used in Section \ref{['subsec_probe']}.
  • Figure 5: Metrics calculated between the ground truth and the predicted frame on each modelling dataset. Higher is better for PSNR and SSIM, and lower is better for L1.
  • ...and 2 more figures