The Power of Next-Frame Prediction for Learning Physical Laws
Thomas Winterbottom, G. Thomas Hudson, Daniel Kluvanec, Dean Slack, Jamie Sterling, Junjie Shentu, Chenghao Xiao, Zheming Zhou, Noura Al Moubayed
TL;DR
Problem: whether visual models can acquire an understanding of physical laws through unsupervised predictive tasks rather than explicit labelling. Approach: train two architectures (FCN CNN and Patch Transformer) on next-frame prediction across six diagnostic dynamic datasets and evaluate emergent physics through linear probing of frozen features. Contributions: six probing datasets, two model families, and a two-step protocol showing significant improvement over baselines in estimating physical constants like gravity after generative pretraining. Significance: demonstrates the inductive power of generative pretraining for learning physical reasoning in vision and highlights avenues for scaling pretraining to richer visual domains.
Abstract
Next-frame prediction is a useful and powerful method for modelling and understanding the dynamics of video data. Inspired by the empirical success of causal language modelling and next-token prediction in language modelling, we explore the extent to which next-frame prediction serves as a strong foundational learning strategy (analogous to language modelling) for inducing an understanding of the visual world. In order to quantify the specific visual understanding induced by next-frame prediction, we introduce six diagnostic simulation video datasets derived from fundamental physical laws created by varying physical constants such as gravity and mass. We demonstrate that our models trained only on next-frame prediction are capable of predicting the value of these physical constants (e.g. gravity) without having been trained directly to learn these constants via a regression task. We find that the generative training phase alone induces a model state that can predict physical constants significantly better than that of a random model, improving the loss by a factor of between 1.28 to 6.24. We conclude that next-frame prediction shows great promise as a general learning strategy to induce understanding of the many `laws' that govern the visual domain without the need for explicit labelling.
