WorldReel: 4D Video Generation with Consistent Geometry and Motion Modeling
Shaoheng Fang, Hanwen Jiang, Yunpeng Bai, Niloy J. Mitra, Qixing Huang
TL;DR
WorldReel introduces a unified 4D video generator that simultaneously outputs RGB frames and explicit 4D scene representations (depth/point cloud, calibrated cameras, and scene flow) to maintain a persistent dynamic 3D world. It leverages a geo–motion latent that fuses depth and optical flow into a diffusion-based transformer, paired with a temporal DPT decoder to produce coherent geometry and motion across time. Training on a mix of synthetic data with precise 4D supervision and real videos with pseudo-labels enables strong generalization while preserving geometric fidelity. Experiments show state-of-the-art 4D consistency and improved geometry/motion metrics, marking a step toward editable, agent-ready 4D world models.
Abstract
Recent video generators achieve striking photorealism, yet remain fundamentally inconsistent in 3D. We present WorldReel, a 4D video generator that is natively spatio-temporally consistent. WorldReel jointly produces RGB frames together with 4D scene representations, including pointmaps, camera trajectory, and dense flow mapping, enabling coherent geometry and appearance modeling over time. Our explicit 4D representation enforces a single underlying scene that persists across viewpoints and dynamic content, yielding videos that remain consistent even under large non-rigid motion and significant camera movement. We train WorldReel by carefully combining synthetic and real data: synthetic data providing precise 4D supervision (geometry, motion, and camera), while real videos contribute visual diversity and realism. This blend allows WorldReel to generalize to in-the-wild footage while preserving strong geometric fidelity. Extensive experiments demonstrate that WorldReel sets a new state-of-the-art for consistent video generation with dynamic scenes and moving cameras, improving metrics of geometric consistency, motion coherence, and reducing view-time artifacts over competing methods. We believe that WorldReel brings video generation closer to 4D-consistent world modeling, where agents can render, interact, and reason about scenes through a single and stable spatiotemporal representation.
