Table of Contents
Fetching ...

FantasyWorld: Geometry-Consistent World Modeling via Unified Video and 3D Prediction

Yixiang Dai, Fan Jiang, Chiyu Wang, Mu Xu, Yonggang Qi

TL;DR

FantasyWorld addresses the lack of explicit 3D grounding in video diffusion models by introducing a geometry-aware, unified backbone that jointly predicts video latents and an implicit 3D field. It achieves this with a two-stage training scheme: Stage 1 latent bridging aligns geometry with Wan2.1 features, and Stage 2 unified co-optimization uses bidirectional cross-attention between a video-imagination branch and a geometry branch, optimizing a combined loss $L_{ ext{total}} = E_{ heta}[||\epsilon - \epsilon_\theta(z_t,t,c)||^2] + \lambda L_{ ext{geo}}$, where $L_{ ext{geo}} = L_{ ext{depth}} + L_{ ext{pmap}} + 3\,L_{ ext{camera}}$. The approach yields improved multi-view coherence and geometric fidelity on WorldScore benchmarks and 3DGS reconstructions, demonstrating stronger 3D consistency and reusable geometry without per-scene optimization. This geometry-augmented framework offers a practical path toward reusable 3D-aware world models for embodied AI and downstream tasks like novel view synthesis and navigation.

Abstract

High-quality 3D world models are pivotal for embodied intelligence and Artificial General Intelligence (AGI), underpinning applications such as AR/VR content creation and robotic navigation. Despite the established strong imaginative priors, current video foundation models lack explicit 3D grounding capabilities, thus being limited in both spatial consistency and their utility for downstream 3D reasoning tasks. In this work, we present FantasyWorld, a geometry-enhanced framework that augments frozen video foundation models with a trainable geometric branch, enabling joint modeling of video latents and an implicit 3D field in a single forward pass. Our approach introduces cross-branch supervision, where geometry cues guide video generation and video priors regularize 3D prediction, thus yielding consistent and generalizable 3D-aware video representations. Notably, the resulting latents from the geometric branch can potentially serve as versatile representations for downstream 3D tasks such as novel view synthesis and navigation, without requiring per-scene optimization or fine-tuning. Extensive experiments show that FantasyWorld effectively bridges video imagination and 3D perception, outperforming recent geometry-consistent baselines in multi-view coherence and style consistency. Ablation studies further confirm that these gains stem from the unified backbone and cross-branch information exchange.

FantasyWorld: Geometry-Consistent World Modeling via Unified Video and 3D Prediction

TL;DR

FantasyWorld addresses the lack of explicit 3D grounding in video diffusion models by introducing a geometry-aware, unified backbone that jointly predicts video latents and an implicit 3D field. It achieves this with a two-stage training scheme: Stage 1 latent bridging aligns geometry with Wan2.1 features, and Stage 2 unified co-optimization uses bidirectional cross-attention between a video-imagination branch and a geometry branch, optimizing a combined loss , where . The approach yields improved multi-view coherence and geometric fidelity on WorldScore benchmarks and 3DGS reconstructions, demonstrating stronger 3D consistency and reusable geometry without per-scene optimization. This geometry-augmented framework offers a practical path toward reusable 3D-aware world models for embodied AI and downstream tasks like novel view synthesis and navigation.

Abstract

High-quality 3D world models are pivotal for embodied intelligence and Artificial General Intelligence (AGI), underpinning applications such as AR/VR content creation and robotic navigation. Despite the established strong imaginative priors, current video foundation models lack explicit 3D grounding capabilities, thus being limited in both spatial consistency and their utility for downstream 3D reasoning tasks. In this work, we present FantasyWorld, a geometry-enhanced framework that augments frozen video foundation models with a trainable geometric branch, enabling joint modeling of video latents and an implicit 3D field in a single forward pass. Our approach introduces cross-branch supervision, where geometry cues guide video generation and video priors regularize 3D prediction, thus yielding consistent and generalizable 3D-aware video representations. Notably, the resulting latents from the geometric branch can potentially serve as versatile representations for downstream 3D tasks such as novel view synthesis and navigation, without requiring per-scene optimization or fine-tuning. Extensive experiments show that FantasyWorld effectively bridges video imagination and 3D perception, outperforming recent geometry-consistent baselines in multi-view coherence and style consistency. Ablation studies further confirm that these gains stem from the unified backbone and cross-branch information exchange.

Paper Structure

This paper contains 21 sections, 10 equations, 6 figures, 2 tables.

Figures (6)

  • Figure 1: FantasyWorld overview. Given multimodal inputs (image, text, and camera trajectory), the model generates photorealistic videos along the specified views while constructing an implicit 3D representation for consistent geometry.
  • Figure 2: Overview of FantasyWorld. Inputs (image, text, camera) are processed by PCBs and stacked IRG blocks, where an asymmetric dual-branch design couples video synthesis with 3D reasoning. The model outputs geometry-consistent video frames and task-agnostic 3D features.
  • Figure 3: PCA over timestep–block pairs: rows vary timesteps top $\rightarrow$ bottom, columns vary blocks left $\rightarrow$ right; the red rectangle marks the IRG input latents.
  • Figure 4: Qualitative comparison of world generation. WonderWorld shows missing regions, Voyager suffers from temporal incoherence and degraded first-frame fidelity, AETHER produces low-detail outputs, and Uni3C exhibits abrupt stylistic shifts. In contrast, FantasyWorld maintains stronger 3D consistency and coherent style across views.
  • Figure 5: Qualitative Comparison of Geometry Fidelity.
  • ...and 1 more figures