Table of Contents
Fetching ...

PhyVLLM: Physics-Guided Video Language Model with Motion-Appearance Disentanglement

Yu-Wei Zhan, Xin Wang, Hong Chen, Tongtong Feng, Wei Feng, Ren Wang, Guangyao Li, Qing Li, Wenwu Zhu

TL;DR

PhyVLLM tackles the lack of physical dynamics modeling in Video LLMs by separating motion from appearance and modeling continuous dynamics with Neural ODEs. The framework projects physics priors into a frozen LLM via lightweight adapters and trains with a self-supervised physics-consistency objective, avoiding explicit physical labels. Evaluations on PhyBench and general video benchmarks show substantial gains in physical reasoning and robustness with a compact fine-tuning approach. These results highlight the practical impact of incorporating explicit physical modeling into large-scale video-language systems.

Abstract

Video Large Language Models (Video LLMs) have shown impressive performance across a wide range of video-language tasks. However, they often fail in scenarios requiring a deeper understanding of physical dynamics. This limitation primarily arises from their reliance on appearance-based matching. Incorporating physical motion modeling is crucial for deeper video understanding, but presents three key challenges: (1) motion signals are often entangled with appearance variations, making it difficult to extract clean physical cues; (2) effective motion modeling requires not only continuous-time motion representations but also capturing physical dynamics; and (3) collecting accurate annotations for physical attributes is costly and often impractical. To address these issues, we propose PhyVLLM, a physical-guided video-language framework that explicitly incorporates physical motion into Video LLMs. Specifically, PhyVLLM disentangles visual appearance and object motion through a dual-branch encoder. To model physical dynamics over time, we incorporate a Neural Ordinary Differential Equation (Neural ODE) module, which generates differentiable physical dynamic representations. The resulting motion-aware representations are projected into the token space of a pretrained LLM, enabling physics reasoning without compromising the model's original multimodal capabilities. To circumvent the need for explicit physical labels, PhyVLLM employs a self-supervised manner to model the continuous evolution of object motion. Experimental results demonstrate that PhyVLLM significantly outperforms state-of-the-art Video LLMs on both physical reasoning and general video understanding tasks, highlighting the advantages of incorporating explicit physical modeling.

PhyVLLM: Physics-Guided Video Language Model with Motion-Appearance Disentanglement

TL;DR

PhyVLLM tackles the lack of physical dynamics modeling in Video LLMs by separating motion from appearance and modeling continuous dynamics with Neural ODEs. The framework projects physics priors into a frozen LLM via lightweight adapters and trains with a self-supervised physics-consistency objective, avoiding explicit physical labels. Evaluations on PhyBench and general video benchmarks show substantial gains in physical reasoning and robustness with a compact fine-tuning approach. These results highlight the practical impact of incorporating explicit physical modeling into large-scale video-language systems.

Abstract

Video Large Language Models (Video LLMs) have shown impressive performance across a wide range of video-language tasks. However, they often fail in scenarios requiring a deeper understanding of physical dynamics. This limitation primarily arises from their reliance on appearance-based matching. Incorporating physical motion modeling is crucial for deeper video understanding, but presents three key challenges: (1) motion signals are often entangled with appearance variations, making it difficult to extract clean physical cues; (2) effective motion modeling requires not only continuous-time motion representations but also capturing physical dynamics; and (3) collecting accurate annotations for physical attributes is costly and often impractical. To address these issues, we propose PhyVLLM, a physical-guided video-language framework that explicitly incorporates physical motion into Video LLMs. Specifically, PhyVLLM disentangles visual appearance and object motion through a dual-branch encoder. To model physical dynamics over time, we incorporate a Neural Ordinary Differential Equation (Neural ODE) module, which generates differentiable physical dynamic representations. The resulting motion-aware representations are projected into the token space of a pretrained LLM, enabling physics reasoning without compromising the model's original multimodal capabilities. To circumvent the need for explicit physical labels, PhyVLLM employs a self-supervised manner to model the continuous evolution of object motion. Experimental results demonstrate that PhyVLLM significantly outperforms state-of-the-art Video LLMs on both physical reasoning and general video understanding tasks, highlighting the advantages of incorporating explicit physical modeling.

Paper Structure

This paper contains 27 sections, 13 equations, 3 figures, 3 tables.

Figures (3)

  • Figure 1: Example of acceleration vs. deceleration recognition. (a) Humans determine whether an object is accelerating or decelerating by reasoning over physical attributes, such as the sign of acceleration. (b) A Video LLM fails to distinguish the two cases due to a lack of motion modeling. (c) Our method explicitly models dynamic motion using a Neural ODE and successfully infers the correct physical state.
  • Figure 2: Overview of the proposed PhyVLLM framework. It consists of three main components: (a) Motion-Appearance Disentanglement, which separates appearance cues and dynamic motion patterns via a dual-branch encoder; (b) Physical-Guided Motion Modeling and Prediction, where a Neural ODE module continuously models object dynamics from the motion features; and (c) Token Projection, which maps the disentangled features into the token space of a pretrained LLM using lightweight adapters, enabling seamless integration while preserving compatibility with the frozen backbone.
  • Figure 3: Similarity heatmap between predicted motion features (T9'–T11') and ground-truth motion features (T0–T11). Darker colors indicate higher similarity.