TRAVL: A Recipe for Making Video-Language Models Better Judges of Physics Implausibility
Saman Motamed, Minghao Chen, Luc Van Gool, Iro Laina
TL;DR
This work addresses the gap in quantitatively evaluating physical realism in video by introducing TRAVL, a trajectory-aware fine-tuning recipe that injects motion-grounded attention into pretrained video-language models without changing their backbones. TRAVL combines intra-frame spatial attention with trajectory-guided temporal attention, guided by sparse patch trajectories from CoTracker, and is paired with a balanced fine-tuning dataset covering plausible and implausible dynamics. To rigorously assess physical reasoning, the authors introduce ImplausiBench, a 300-video benchmark with adversarial blind testing that minimizes linguistic shortcuts and isolates visual-temporal understanding. Experimental results show that TRAVL improves implausibility detection across backbones (Video-ChatGPT and LLaVA-NeXT) under both human and LLM-judge evaluations, though challenges remain in broad generalization and calibration on real vs. generated content. Overall, TRAVL provides a lightweight, extensible approach to enhance physical grounding in VLMs and ImplausiBench offers a high-fidelity diagnostic for grounded visual-temporal reasoning in this domain, with significant implications for evaluating and improving physics plausibility in future multimodal models.
Abstract
Despite impressive visual fidelity, modern video generative models frequently produce sequences that violate intuitive physical laws, such as objects floating, teleporting, or morphing in ways that defy causality. While humans can easily detect such implausibilities, there remains no robust method for quantitatively assessing physical realism in video. In this work, we explore whether Video-Language Models (VLMs) can be trained to serve as reliable judges of physical plausibility. We find that existing VLMs struggle to identify physics violations, exposing fundamental limitations in their temporal and causal reasoning. To address this, we introduce TRAVL, a fine-tuning recipe that combines a balanced training dataset with a trajectory-aware attention module to improve motion encoding and discrimination in VLMs. To evaluate physical reasoning more rigorously, we propose ImplausiBench, a benchmark of 300 videos (150 real, 150 generated) that removes linguistic biases and isolates visual-temporal understanding. Performance is reported both with gold-standard human judgments and stricter LLM-as-judge metrics. Together, TRAVL and ImplausiBench offer a unified framework for probing and improving physical plausibility in multimodal models, shedding light on a challenging and underexplored aspect of visual-temporal understanding.
