Table of Contents
Fetching ...

TrajLoom: Dense Future Trajectory Generation from Video

Zewei Zhang, Jia Jun Cheng Xian, Kaiwen Liu, Ming Liang, Hang Chu, Jun Chen, Renjie Liao

Abstract

Predicting future motion is crucial in video understanding and controllable video generation. Dense point trajectories are a compact, expressive motion representation, but modeling their future evolution from observed video remains challenging. We propose a framework that predicts future trajectories and visibility from past trajectories and video context. Our method has three components: (1) Grid-Anchor Offset Encoding, which reduces location-dependent bias by representing each point as an offset from its pixel-center anchor; (2) TrajLoom-VAE, which learns a compact spatiotemporal latent space for dense trajectories with masked reconstruction and a spatiotemporal consistency regularizer; and (3) TrajLoom-Flow, which generates future trajectories in latent space via flow matching, with boundary cues and on-policy K-step fine-tuning for stable sampling. We also introduce TrajLoomBench, a unified benchmark spanning real and synthetic videos with a standardized setup aligned with video-generation benchmarks. Compared with state-of-the-art methods, our approach extends the prediction horizon from 24 to 81 frames while improving motion realism and stability across datasets. The predicted trajectories directly support downstream video generation and editing. Code, model checkpoints, and datasets are available at https://trajloom.github.io/.

TrajLoom: Dense Future Trajectory Generation from Video

Abstract

Predicting future motion is crucial in video understanding and controllable video generation. Dense point trajectories are a compact, expressive motion representation, but modeling their future evolution from observed video remains challenging. We propose a framework that predicts future trajectories and visibility from past trajectories and video context. Our method has three components: (1) Grid-Anchor Offset Encoding, which reduces location-dependent bias by representing each point as an offset from its pixel-center anchor; (2) TrajLoom-VAE, which learns a compact spatiotemporal latent space for dense trajectories with masked reconstruction and a spatiotemporal consistency regularizer; and (3) TrajLoom-Flow, which generates future trajectories in latent space via flow matching, with boundary cues and on-policy K-step fine-tuning for stable sampling. We also introduce TrajLoomBench, a unified benchmark spanning real and synthetic videos with a standardized setup aligned with video-generation benchmarks. Compared with state-of-the-art methods, our approach extends the prediction horizon from 24 to 81 frames while improving motion realism and stability across datasets. The predicted trajectories directly support downstream video generation and editing. Code, model checkpoints, and datasets are available at https://trajloom.github.io/.
Paper Structure (55 sections, 21 equations, 10 figures, 6 tables)

This paper contains 55 sections, 21 equations, 10 figures, 6 tables.

Figures (10)

  • Figure 1: For each sequence, the model observes an 81-frame history (left) and predicts future trajectories for the next 81 frames (right). Predicted trajectories are shown at three times: early, middle, and final. Colors show the spatial order of query points.
  • Figure 2: Overview of our pipeline. Given observed trajectories $\mathcal{T}^{p}$, we rasterize and encode them with Grid-Anchor Offset Encoding into a dense offset field, then compress with TrajLoom-VAE into history latents $\mathbf{z}^{p}$. Conditioned on $\mathbf{z}^{p}$ and video features, TrajLoom-Flow generates future latents via rectified-flow integration with boundary hints, which are decoded by TrajLoom-VAE into future trajectories $\hat{\mathcal{T}}^{f}$.
  • Figure 3: Grid-Anchor Offset Encoding converts absolute trajectories into offset space, reducing the bias of absolute coordinates.
  • Figure 4: We initialize $\mathbf{z}_0$ from scaled Gaussian noise, then add each last history token $\mathbf{z}(-1,n)$ to the first future token $\mathbf{z}_0(0,n)$.
  • Figure 5: Comparison with WHN (L). Each row shows a dataset: Kinetics, RoboTAP, Kubric, and MagicData (E). Our method yields smoother and more coherent motion.
  • ...and 5 more figures