Tracking-Guided 4D Generation: Foundation-Tracker Motion Priors for 3D Model Animation

Su Sun; Cheng Zhao; Himangi Mittal; Gaurav Mittal; Rohith Kukkala; Yingjie Victor Chen; Mei Chen

Tracking-Guided 4D Generation: Foundation-Tracker Motion Priors for 3D Model Animation

Su Sun, Cheng Zhao, Himangi Mittal, Gaurav Mittal, Rohith Kukkala, Yingjie Victor Chen, Mei Chen

TL;DR

Track4DGen introduces a two-stage framework that leverages a foundation tracker to inject motion priors into a multi-view diffusion model and a hybrid 4D-GS reconstructor. Stage One enforces dense, feature-level point correspondence to suppress appearance drift and improve cross-view coherence; Stage Two fuses diffusion features with Hex-plane information and 4D Spherical Harmonics to achieve higher-fidelity dynamic rendering. The approach delivers temporally stable, text-editable 4D assets, validated with substantial gains on multi-view video metrics and CLIP-based 4D evaluations, and is supported by a new Sketchfab28 benchmark for object-centric 4D generation. The work also provides extensive implementation details, ablations, and public release plans to facilitate reproducibility and further research in 4D asset generation.

Abstract

Generating dynamic 4D objects from sparse inputs is difficult because it demands joint preservation of appearance and motion coherence across views and time while suppressing artifacts and temporal drift. We hypothesize that the view discrepancy arises from supervision limited to pixel- or latent-space video-diffusion losses, which lack explicitly temporally aware, feature-level tracking guidance. We present \emph{Track4DGen}, a two-stage framework that couples a multi-view video diffusion model with a foundation point tracker and a hybrid 4D Gaussian Splatting (4D-GS) reconstructor. The central idea is to explicitly inject tracker-derived motion priors into intermediate feature representations for both multi-view video generation and 4D-GS. In Stage One, we enforce dense, feature-level point correspondences inside the diffusion generator, producing temporally consistent features that curb appearance drift and enhance cross-view coherence. In Stage Two, we reconstruct a dynamic 4D-GS using a hybrid motion encoding that concatenates co-located diffusion features (carrying Stage-One tracking priors) with Hex-plane features, and augment them with 4D Spherical Harmonics for higher-fidelity dynamics modeling. \emph{Track4DGen} surpasses baselines on both multi-view video generation and 4D generation benchmarks, yielding temporally stable, text-editable 4D assets. Lastly, we curate \emph{Sketchfab28}, a high-quality dataset for benchmarking object-centric 4D generation and fostering future research.

Tracking-Guided 4D Generation: Foundation-Tracker Motion Priors for 3D Model Animation

TL;DR

Abstract

Tracking-Guided 4D Generation: Foundation-Tracker Motion Priors for 3D Model Animation

TL;DR

Abstract

Paper Structure

Table of Contents

Figures (11)