Inferring Dynamic Physical Properties from Video Foundation Models
Guanqi Zhan, Xianzheng Ma, Weidi Xie, Andrew Zisserman
TL;DR
This work investigates inferring dynamic physical properties from video using PhysVid, a dataset with synthetic and real-world sequences for elasticity, viscosity, and dynamic friction. It evaluates three inference pathways: oracle estimation, readouts from frozen video foundation models (with a learnable visual prompt), and prompting strategies for multimodal language models, highlighting that generative and self-supervised backbones achieve strong synthetic performance and reasonable real-world generalization, while MLLMs require careful prompting. The results show a consistent gap to the oracle, especially for absolute value prediction, with domain adaptation and simple cues like a red circle helping bridge the sim-to-real gap. Overall, the findings underscore the potential and current limits of video foundation models in concrete physical reasoning, guiding future improvements in representation learning and task-specific prompting.
Abstract
We study the task of predicting dynamic physical properties from videos. More specifically, we consider physical properties that require temporal information to be inferred: elasticity of a bouncing object, viscosity of a flowing liquid, and dynamic friction of an object sliding on a surface. To this end, we make the following contributions: (i) We collect a new video dataset for each physical property, consisting of synthetic training and testing splits, as well as a real split for real world evaluation. (ii) We explore three ways to infer the physical property from videos: (a) an oracle method where we supply the visual cues that intrinsically reflect the property using classical computer vision techniques; (b) a simple read out mechanism using a visual prompt and trainable prompt vector for cross-attention on pre-trained video generative and self-supervised models; and (c) prompt strategies for Multi-modal Large Language Models (MLLMs). (iii) We show that video foundation models trained in a generative or self-supervised manner achieve a similar performance, though behind that of the oracle, and MLLMs are currently inferior to the other models, though their performance can be improved through suitable prompting.
