Table of Contents
Fetching ...

TRAVL: A Recipe for Making Video-Language Models Better Judges of Physics Implausibility

Saman Motamed, Minghao Chen, Luc Van Gool, Iro Laina

TL;DR

This work addresses the gap in quantitatively evaluating physical realism in video by introducing TRAVL, a trajectory-aware fine-tuning recipe that injects motion-grounded attention into pretrained video-language models without changing their backbones. TRAVL combines intra-frame spatial attention with trajectory-guided temporal attention, guided by sparse patch trajectories from CoTracker, and is paired with a balanced fine-tuning dataset covering plausible and implausible dynamics. To rigorously assess physical reasoning, the authors introduce ImplausiBench, a 300-video benchmark with adversarial blind testing that minimizes linguistic shortcuts and isolates visual-temporal understanding. Experimental results show that TRAVL improves implausibility detection across backbones (Video-ChatGPT and LLaVA-NeXT) under both human and LLM-judge evaluations, though challenges remain in broad generalization and calibration on real vs. generated content. Overall, TRAVL provides a lightweight, extensible approach to enhance physical grounding in VLMs and ImplausiBench offers a high-fidelity diagnostic for grounded visual-temporal reasoning in this domain, with significant implications for evaluating and improving physics plausibility in future multimodal models.

Abstract

Despite impressive visual fidelity, modern video generative models frequently produce sequences that violate intuitive physical laws, such as objects floating, teleporting, or morphing in ways that defy causality. While humans can easily detect such implausibilities, there remains no robust method for quantitatively assessing physical realism in video. In this work, we explore whether Video-Language Models (VLMs) can be trained to serve as reliable judges of physical plausibility. We find that existing VLMs struggle to identify physics violations, exposing fundamental limitations in their temporal and causal reasoning. To address this, we introduce TRAVL, a fine-tuning recipe that combines a balanced training dataset with a trajectory-aware attention module to improve motion encoding and discrimination in VLMs. To evaluate physical reasoning more rigorously, we propose ImplausiBench, a benchmark of 300 videos (150 real, 150 generated) that removes linguistic biases and isolates visual-temporal understanding. Performance is reported both with gold-standard human judgments and stricter LLM-as-judge metrics. Together, TRAVL and ImplausiBench offer a unified framework for probing and improving physical plausibility in multimodal models, shedding light on a challenging and underexplored aspect of visual-temporal understanding.

TRAVL: A Recipe for Making Video-Language Models Better Judges of Physics Implausibility

TL;DR

This work addresses the gap in quantitatively evaluating physical realism in video by introducing TRAVL, a trajectory-aware fine-tuning recipe that injects motion-grounded attention into pretrained video-language models without changing their backbones. TRAVL combines intra-frame spatial attention with trajectory-guided temporal attention, guided by sparse patch trajectories from CoTracker, and is paired with a balanced fine-tuning dataset covering plausible and implausible dynamics. To rigorously assess physical reasoning, the authors introduce ImplausiBench, a 300-video benchmark with adversarial blind testing that minimizes linguistic shortcuts and isolates visual-temporal understanding. Experimental results show that TRAVL improves implausibility detection across backbones (Video-ChatGPT and LLaVA-NeXT) under both human and LLM-judge evaluations, though challenges remain in broad generalization and calibration on real vs. generated content. Overall, TRAVL provides a lightweight, extensible approach to enhance physical grounding in VLMs and ImplausiBench offers a high-fidelity diagnostic for grounded visual-temporal reasoning in this domain, with significant implications for evaluating and improving physics plausibility in future multimodal models.

Abstract

Despite impressive visual fidelity, modern video generative models frequently produce sequences that violate intuitive physical laws, such as objects floating, teleporting, or morphing in ways that defy causality. While humans can easily detect such implausibilities, there remains no robust method for quantitatively assessing physical realism in video. In this work, we explore whether Video-Language Models (VLMs) can be trained to serve as reliable judges of physical plausibility. We find that existing VLMs struggle to identify physics violations, exposing fundamental limitations in their temporal and causal reasoning. To address this, we introduce TRAVL, a fine-tuning recipe that combines a balanced training dataset with a trajectory-aware attention module to improve motion encoding and discrimination in VLMs. To evaluate physical reasoning more rigorously, we propose ImplausiBench, a benchmark of 300 videos (150 real, 150 generated) that removes linguistic biases and isolates visual-temporal understanding. Performance is reported both with gold-standard human judgments and stricter LLM-as-judge metrics. Together, TRAVL and ImplausiBench offer a unified framework for probing and improving physical plausibility in multimodal models, shedding light on a challenging and underexplored aspect of visual-temporal understanding.

Paper Structure

This paper contains 62 sections, 2 equations, 11 figures, 4 tables.

Figures (11)

  • Figure 1: Video Language Models (VLMs) often struggle with fine-grained understanding of physics realism. We propose a fine-tuning recipe that helps VLMs become better judges of physics implausibility.
  • Figure 2: Overview of our proposed TRAVL framework. Given input video frames, we apply a vision encoder followed by trajectory-aware masked self-attention, which integrates spatial and temporal context using patch trajectories tracked by CoTracker. The enriched features are projected into the language model's embedding space. Only the trajectory attention and vision-to-language projector are fine-tuned; the vision encoder and language model are kept frozen.
  • Figure 3: Fine-tuning data pipeline. Our dataset is built in three stages: Stage 1 (Plausible Captioning): GPT-4o generates initial captions for real (plausible) videos, verified by human reviewers. Stage 2 (Feedback-Augmented Captioning): Human annotators provide short temporal feedback for each implausible video, which is combined with the original real caption to create a complete description using GPT-4o. Stage 3 (QA Generation): Based on the final caption, GPT-4o produces temporally grounded question-answer pairs per video. This pipeline enables fine-grained supervision across a controlled set of plausible and implausible variants.
  • Figure 4: Example from ImplausiBench. For each scenario, we include both a real (plausible) and a generated (implausible) video that share the same initial scene and visual style. Each pair is annotated with a shared multiple-choice question containing three plausible, three implausible, and one "None of the above" option. The correct answer depends on which version of the video is shown—ensuring that models must ground their predictions in visual-temporal evidence rather than language alone.
  • Figure 5: Qualitative examples from TRAVL. The first two pages show frames from Impossible Videos, while the remaining illustrate plausible and implausible variants from ImplausiBench. These examples were selected to showcase representative successes (check mark) and failures (cross) across different models, as identified through manual inspection.
  • ...and 6 more figures