Table of Contents
Fetching ...

From Seeing to Predicting: A Vision-Language Framework for Trajectory Forecasting and Controlled Video Generation

Fan Yang, Zhiyang Chen, Yousong Zhu, Xin Li, Jinqiao Wang

TL;DR

The paper tackles the problem of physically inconsistent motion in contemporary video generation by proposing TrajVLM-Gen, a two-stage framework that couples trajectory reasoning with controllable video synthesis. It uses a Vision-Language Model to predict coarse, physics-faithful trajectories represented as bounding-box sequences $p=(x_1,y_1,x_2,y_2)$ and $[p_1,...,p_n]$, optionally via chain-of-thought reasoning to capture physical dynamics. A trajectory-aware OpenSora-based generator then enforces motion alignment through cross-attention constrained by an energy term $E(A_t)=M_{ ext{traj}} frac{}{} ext{sign}(A_t) - abla^2 A_t$, with trajectories converted to text and fed into the query. A 1.3M-trajectory dataset is constructed from public visual tracking sources, including gravity/elastic/perspective labels for physical guidance, with public release planned. Experiments on UCF-101 and MSR-VTT show competitive FVD scores (e.g., 545 on UCF-101 and 539 on MSR-VTT) and improved trajectory-prediction metrics, validating physics-aware trajectory-guided video generation and its practical potential for reliable, controllable video synthesis.

Abstract

Current video generation models produce physically inconsistent motion that violates real-world dynamics. We propose TrajVLM-Gen, a two-stage framework for physics-aware image-to-video generation. First, we employ a Vision Language Model to predict coarse-grained motion trajectories that maintain consistency with real-world physics. Second, these trajectories guide video generation through attention-based mechanisms for fine-grained motion refinement. We build a trajectory prediction dataset based on video tracking data with realistic motion patterns. Experiments on UCF-101 and MSR-VTT demonstrate that TrajVLM-Gen outperforms existing methods, achieving competitive FVD scores of 545 on UCF-101 and 539 on MSR-VTT.

From Seeing to Predicting: A Vision-Language Framework for Trajectory Forecasting and Controlled Video Generation

TL;DR

The paper tackles the problem of physically inconsistent motion in contemporary video generation by proposing TrajVLM-Gen, a two-stage framework that couples trajectory reasoning with controllable video synthesis. It uses a Vision-Language Model to predict coarse, physics-faithful trajectories represented as bounding-box sequences and , optionally via chain-of-thought reasoning to capture physical dynamics. A trajectory-aware OpenSora-based generator then enforces motion alignment through cross-attention constrained by an energy term , with trajectories converted to text and fed into the query. A 1.3M-trajectory dataset is constructed from public visual tracking sources, including gravity/elastic/perspective labels for physical guidance, with public release planned. Experiments on UCF-101 and MSR-VTT show competitive FVD scores (e.g., 545 on UCF-101 and 539 on MSR-VTT) and improved trajectory-prediction metrics, validating physics-aware trajectory-guided video generation and its practical potential for reliable, controllable video synthesis.

Abstract

Current video generation models produce physically inconsistent motion that violates real-world dynamics. We propose TrajVLM-Gen, a two-stage framework for physics-aware image-to-video generation. First, we employ a Vision Language Model to predict coarse-grained motion trajectories that maintain consistency with real-world physics. Second, these trajectories guide video generation through attention-based mechanisms for fine-grained motion refinement. We build a trajectory prediction dataset based on video tracking data with realistic motion patterns. Experiments on UCF-101 and MSR-VTT demonstrate that TrajVLM-Gen outperforms existing methods, achieving competitive FVD scores of 545 on UCF-101 and 539 on MSR-VTT.

Paper Structure

This paper contains 10 sections, 1 equation, 3 figures, 5 tables.

Figures (3)

  • Figure 1: TrajVLM-Gen allows humans to query an initial video frame with questions, generating textual answers that include both reasoning processes and trajectory predictions. Predicted trajectories are visualized as bounding boxes, where red marks the starting point and blue denotes the ending positions along the path. Guided by these trajectories, the system synthesizes controllable video content that aligns with the specified motion.
  • Figure 2: Our overall framework, TrajVLM-Gen, enables trajectory reasoning and controllable video generation.
  • Figure 3: Based on our predicted trajectories, we can generate videos with reasonable motion that follows physical laws.