Table of Contents
Fetching ...

Inferring Dynamic Physical Properties from Video Foundation Models

Guanqi Zhan, Xianzheng Ma, Weidi Xie, Andrew Zisserman

TL;DR

This work investigates inferring dynamic physical properties from video using PhysVid, a dataset with synthetic and real-world sequences for elasticity, viscosity, and dynamic friction. It evaluates three inference pathways: oracle estimation, readouts from frozen video foundation models (with a learnable visual prompt), and prompting strategies for multimodal language models, highlighting that generative and self-supervised backbones achieve strong synthetic performance and reasonable real-world generalization, while MLLMs require careful prompting. The results show a consistent gap to the oracle, especially for absolute value prediction, with domain adaptation and simple cues like a red circle helping bridge the sim-to-real gap. Overall, the findings underscore the potential and current limits of video foundation models in concrete physical reasoning, guiding future improvements in representation learning and task-specific prompting.

Abstract

We study the task of predicting dynamic physical properties from videos. More specifically, we consider physical properties that require temporal information to be inferred: elasticity of a bouncing object, viscosity of a flowing liquid, and dynamic friction of an object sliding on a surface. To this end, we make the following contributions: (i) We collect a new video dataset for each physical property, consisting of synthetic training and testing splits, as well as a real split for real world evaluation. (ii) We explore three ways to infer the physical property from videos: (a) an oracle method where we supply the visual cues that intrinsically reflect the property using classical computer vision techniques; (b) a simple read out mechanism using a visual prompt and trainable prompt vector for cross-attention on pre-trained video generative and self-supervised models; and (c) prompt strategies for Multi-modal Large Language Models (MLLMs). (iii) We show that video foundation models trained in a generative or self-supervised manner achieve a similar performance, though behind that of the oracle, and MLLMs are currently inferior to the other models, though their performance can be improved through suitable prompting.

Inferring Dynamic Physical Properties from Video Foundation Models

TL;DR

This work investigates inferring dynamic physical properties from video using PhysVid, a dataset with synthetic and real-world sequences for elasticity, viscosity, and dynamic friction. It evaluates three inference pathways: oracle estimation, readouts from frozen video foundation models (with a learnable visual prompt), and prompting strategies for multimodal language models, highlighting that generative and self-supervised backbones achieve strong synthetic performance and reasonable real-world generalization, while MLLMs require careful prompting. The results show a consistent gap to the oracle, especially for absolute value prediction, with domain adaptation and simple cues like a red circle helping bridge the sim-to-real gap. Overall, the findings underscore the potential and current limits of video foundation models in concrete physical reasoning, guiding future improvements in representation learning and task-specific prompting.

Abstract

We study the task of predicting dynamic physical properties from videos. More specifically, we consider physical properties that require temporal information to be inferred: elasticity of a bouncing object, viscosity of a flowing liquid, and dynamic friction of an object sliding on a surface. To this end, we make the following contributions: (i) We collect a new video dataset for each physical property, consisting of synthetic training and testing splits, as well as a real split for real world evaluation. (ii) We explore three ways to infer the physical property from videos: (a) an oracle method where we supply the visual cues that intrinsically reflect the property using classical computer vision techniques; (b) a simple read out mechanism using a visual prompt and trainable prompt vector for cross-attention on pre-trained video generative and self-supervised models; and (c) prompt strategies for Multi-modal Large Language Models (MLLMs). (iii) We show that video foundation models trained in a generative or self-supervised manner achieve a similar performance, though behind that of the oracle, and MLLMs are currently inferior to the other models, though their performance can be improved through suitable prompting.

Paper Structure

This paper contains 28 sections, 5 equations, 23 figures, 7 tables.

Figures (23)

  • Figure 1: Examples of the PhysVid dataset. Each row shows a different property, and each column shows three frames from video samples in the synthetic sets (train, test-1, and test-2) and the real test-3 set. The train and test-1 sets are from the same distribution. In test-2 parameters, such as lighting, viewpoint and color, differ from those in test-1.
  • Figure 2: Oracle methods for physical properties. The objective in each case is to extract a measurement from the sequence that can directly be used to predict the property. For elasticity, we extract the centroid trajectory from segmentation masks, and then normalize the $y$-coordinates into $0$-$1$; the ratio of bouncing to dropping height over the sequence indicates the elasticity. For viscosity, we calculate the area size in the image via segmentation masks, and then normalize the area sizes by the area in the frame when the liquid first touches the ground; the slope of the normalized area size sequence reflects the viscosity. For friction, we transform to a bird's eye view (using a homography transformation based on 4 corner points of the top surface of the sliding object), and fit a parabola $x = \alpha t^2 + \beta t + c$ to the transformed trajectory; the parabola coefficient $\alpha$ predicts the friction coefficient. For each video, we show the segmentation for two frames (left $\rightarrow$ right).
  • Figure 3: Architectures for dynamic physical property prediction.Left: video generative model as backbone; Middle: video self-supervised model as backbone; Right: multimodal large language model (MLLM). For the pre-trained video diffusion model (U-Net, left) and the pre-trained self-supervised model (ViT, middle), the representations are kept frozen, and a 'visual prompt' learns to infer the physical properties. For the MLLMs, the physical properties are inferred using a language prompt (right).
  • Figure 4: Qualitative results.Top Left: An example for elasticity absolute value prediction; Bottom Left: An example for friction relative value comparison. For each example, the original input video is shown on the left. A static red circle is overlaid in the center to highlight the full trajectory of the object on every frame, shown in the middle. Model predictions are shown on the right, including results from the Video Generative Model (VGM), Video Self-Supervised Model (VSM), and a MLLM (Gemini in this case). For the relative formulation, the ground truth value of '$1$' indicates that the first (top) video has larger dynamic friction coefficient than the second video. In this example, the initial velocity of the lego brick in the two videos is similar (note the same displacement from frame $0$ to $2$), but the velocity reduces to $0$ at frame $30$ in the first video, while the object is still moving in frame $30$ to $60$ in the second video. Right: Scatter plots of prediction vs ground truth for the elasticity property from the V-JEPA-2 model.
  • Figure 5: Objects and surfaces in the friction real dataset. Top: Objects used for friction real dataset collection; Bottom: Surfaces used for friction real dataset collection.
  • ...and 18 more figures