Table of Contents
Fetching ...

WorldReel: 4D Video Generation with Consistent Geometry and Motion Modeling

Shaoheng Fang, Hanwen Jiang, Yunpeng Bai, Niloy J. Mitra, Qixing Huang

TL;DR

WorldReel introduces a unified 4D video generator that simultaneously outputs RGB frames and explicit 4D scene representations (depth/point cloud, calibrated cameras, and scene flow) to maintain a persistent dynamic 3D world. It leverages a geo–motion latent that fuses depth and optical flow into a diffusion-based transformer, paired with a temporal DPT decoder to produce coherent geometry and motion across time. Training on a mix of synthetic data with precise 4D supervision and real videos with pseudo-labels enables strong generalization while preserving geometric fidelity. Experiments show state-of-the-art 4D consistency and improved geometry/motion metrics, marking a step toward editable, agent-ready 4D world models.

Abstract

Recent video generators achieve striking photorealism, yet remain fundamentally inconsistent in 3D. We present WorldReel, a 4D video generator that is natively spatio-temporally consistent. WorldReel jointly produces RGB frames together with 4D scene representations, including pointmaps, camera trajectory, and dense flow mapping, enabling coherent geometry and appearance modeling over time. Our explicit 4D representation enforces a single underlying scene that persists across viewpoints and dynamic content, yielding videos that remain consistent even under large non-rigid motion and significant camera movement. We train WorldReel by carefully combining synthetic and real data: synthetic data providing precise 4D supervision (geometry, motion, and camera), while real videos contribute visual diversity and realism. This blend allows WorldReel to generalize to in-the-wild footage while preserving strong geometric fidelity. Extensive experiments demonstrate that WorldReel sets a new state-of-the-art for consistent video generation with dynamic scenes and moving cameras, improving metrics of geometric consistency, motion coherence, and reducing view-time artifacts over competing methods. We believe that WorldReel brings video generation closer to 4D-consistent world modeling, where agents can render, interact, and reason about scenes through a single and stable spatiotemporal representation.

WorldReel: 4D Video Generation with Consistent Geometry and Motion Modeling

TL;DR

WorldReel introduces a unified 4D video generator that simultaneously outputs RGB frames and explicit 4D scene representations (depth/point cloud, calibrated cameras, and scene flow) to maintain a persistent dynamic 3D world. It leverages a geo–motion latent that fuses depth and optical flow into a diffusion-based transformer, paired with a temporal DPT decoder to produce coherent geometry and motion across time. Training on a mix of synthetic data with precise 4D supervision and real videos with pseudo-labels enables strong generalization while preserving geometric fidelity. Experiments show state-of-the-art 4D consistency and improved geometry/motion metrics, marking a step toward editable, agent-ready 4D world models.

Abstract

Recent video generators achieve striking photorealism, yet remain fundamentally inconsistent in 3D. We present WorldReel, a 4D video generator that is natively spatio-temporally consistent. WorldReel jointly produces RGB frames together with 4D scene representations, including pointmaps, camera trajectory, and dense flow mapping, enabling coherent geometry and appearance modeling over time. Our explicit 4D representation enforces a single underlying scene that persists across viewpoints and dynamic content, yielding videos that remain consistent even under large non-rigid motion and significant camera movement. We train WorldReel by carefully combining synthetic and real data: synthetic data providing precise 4D supervision (geometry, motion, and camera), while real videos contribute visual diversity and realism. This blend allows WorldReel to generalize to in-the-wild footage while preserving strong geometric fidelity. Extensive experiments demonstrate that WorldReel sets a new state-of-the-art for consistent video generation with dynamic scenes and moving cameras, improving metrics of geometric consistency, motion coherence, and reducing view-time artifacts over competing methods. We believe that WorldReel brings video generation closer to 4D-consistent world modeling, where agents can render, interact, and reason about scenes through a single and stable spatiotemporal representation.

Paper Structure

This paper contains 14 sections, 5 equations, 4 figures, 3 tables.

Figures (4)

  • Figure 1: End-to-end 4D generation. Given a text prompt and a single input image (left), WorldReel generates a video (center) together with explicit 4D scene representations: per-frame geometry (depth + point cloud) with calibrated camera poses, and per-frame motion (optical flow, scene flow) with object masks (bottom panels). The rendered 4D scenes (right) exhibit consistent structure over time, even under non-rigid dynamics, illustrating spatio-temporal consistency and tight coupling of appearance, geometry, and motion. Project page: https://bshfang.github.io/worldreel/
  • Figure 2: Overview of WorldReel. We augment a video diffusion transformer with a geo–motion latent (from RGB and 2.5D cues such as depth/optical flow) to inject a 4D inductive bias for spatio-temporal consistency. A temporal DPT decoder is trained with direct supervision and regularization to predict unified 4D outputs (depth/point cloud, calibrated camera, 3D scene flow, and masks).
  • Figure 3: Qualitative image-to-video comparison on in-the-wild scenes. Given a single input image (left), we show sampled frames from videos generated by 4DNeX chen20254dnex, DimensionX sun2024dimensionx, GeoVideo bai2025geovideo, and WorldReel (ours). Prior methods often exhibit geometry drift and motion inconsistencies (e.g., warped facades, misaligned vehicles), while our results better preserve scene layout and maintain coherent camera and non-rigid dynamics. See the supplementary for prompts, full videos for all methods, and additional comparisons.
  • Figure 4: Qualitative 4D generation and geometry. For two in-the-wild inputs (left, red boxes), we show selected frames from our generated videos (top rows) alongside the corresponding dynamic point clouds rendered from our pointmaps and camera trajectories (bottom rows). The persistent structure and consistent camera/object motion illustrate a single, stable 3D scene across time, evidencing strong geometric consistency in the underlying world state. See supplementary for additional examples.