From Seeing to Predicting: A Vision-Language Framework for Trajectory Forecasting and Controlled Video Generation

Fan Yang; Zhiyang Chen; Yousong Zhu; Xin Li; Jinqiao Wang

From Seeing to Predicting: A Vision-Language Framework for Trajectory Forecasting and Controlled Video Generation

Fan Yang, Zhiyang Chen, Yousong Zhu, Xin Li, Jinqiao Wang

TL;DR

The paper tackles the problem of physically inconsistent motion in contemporary video generation by proposing TrajVLM-Gen, a two-stage framework that couples trajectory reasoning with controllable video synthesis. It uses a Vision-Language Model to predict coarse, physics-faithful trajectories represented as bounding-box sequences $p=(x_1,y_1,x_2,y_2)$ and $[p_1,...,p_n]$, optionally via chain-of-thought reasoning to capture physical dynamics. A trajectory-aware OpenSora-based generator then enforces motion alignment through cross-attention constrained by an energy term $E(A_t)=M_{ ext{traj}} frac{}{} ext{sign}(A_t) - abla^2 A_t$, with trajectories converted to text and fed into the query. A 1.3M-trajectory dataset is constructed from public visual tracking sources, including gravity/elastic/perspective labels for physical guidance, with public release planned. Experiments on UCF-101 and MSR-VTT show competitive FVD scores (e.g., 545 on UCF-101 and 539 on MSR-VTT) and improved trajectory-prediction metrics, validating physics-aware trajectory-guided video generation and its practical potential for reliable, controllable video synthesis.

Abstract

Current video generation models produce physically inconsistent motion that violates real-world dynamics. We propose TrajVLM-Gen, a two-stage framework for physics-aware image-to-video generation. First, we employ a Vision Language Model to predict coarse-grained motion trajectories that maintain consistency with real-world physics. Second, these trajectories guide video generation through attention-based mechanisms for fine-grained motion refinement. We build a trajectory prediction dataset based on video tracking data with realistic motion patterns. Experiments on UCF-101 and MSR-VTT demonstrate that TrajVLM-Gen outperforms existing methods, achieving competitive FVD scores of 545 on UCF-101 and 539 on MSR-VTT.

From Seeing to Predicting: A Vision-Language Framework for Trajectory Forecasting and Controlled Video Generation

TL;DR

Abstract

From Seeing to Predicting: A Vision-Language Framework for Trajectory Forecasting and Controlled Video Generation

TL;DR

Abstract

Paper Structure

Table of Contents

Figures (3)