Table of Contents
Fetching ...

World-consistent Video Diffusion with Explicit 3D Modeling

Qihang Zhang, Shuangfei Zhai, Miguel Angel Bautista, Kevin Miao, Alexander Toshev, Joshua Susskind, Jiatao Gu

TL;DR

World-consistent Video Diffusion (WVD) introduces explicit 3D supervision for diffusion-based video and multi-view generation by jointly modeling RGB frames and geometry via XYZ images. A Diffusion Transformer learns the joint distribution of RGB and XYZ, enabling flexible inpainting-based inference and a post-optimization step to recover camera parameters and depth, effectively unifying single-image-to-3D, multi-view stereo, and camera-controlled video generation. The approach demonstrates competitive performance across RealEstate10K, ScanNet, MVImgNet, CO3D, and Habitat, and shows strong 3D consistency alongside high-fidelity appearance. These results suggest WVD as a scalable 3D foundation model, capable of supporting a range of downstream tasks with a single pretrained model.

Abstract

Recent advancements in diffusion models have set new benchmarks in image and video generation, enabling realistic visual synthesis across single- and multi-frame contexts. However, these models still struggle with efficiently and explicitly generating 3D-consistent content. To address this, we propose World-consistent Video Diffusion (WVD), a novel framework that incorporates explicit 3D supervision using XYZ images, which encode global 3D coordinates for each image pixel. More specifically, we train a diffusion transformer to learn the joint distribution of RGB and XYZ frames. This approach supports multi-task adaptability via a flexible inpainting strategy. For example, WVD can estimate XYZ frames from ground-truth RGB or generate novel RGB frames using XYZ projections along a specified camera trajectory. In doing so, WVD unifies tasks like single-image-to-3D generation, multi-view stereo, and camera-controlled video generation. Our approach demonstrates competitive performance across multiple benchmarks, providing a scalable solution for 3D-consistent video and image generation with a single pretrained model.

World-consistent Video Diffusion with Explicit 3D Modeling

TL;DR

World-consistent Video Diffusion (WVD) introduces explicit 3D supervision for diffusion-based video and multi-view generation by jointly modeling RGB frames and geometry via XYZ images. A Diffusion Transformer learns the joint distribution of RGB and XYZ, enabling flexible inpainting-based inference and a post-optimization step to recover camera parameters and depth, effectively unifying single-image-to-3D, multi-view stereo, and camera-controlled video generation. The approach demonstrates competitive performance across RealEstate10K, ScanNet, MVImgNet, CO3D, and Habitat, and shows strong 3D consistency alongside high-fidelity appearance. These results suggest WVD as a scalable 3D foundation model, capable of supporting a range of downstream tasks with a single pretrained model.

Abstract

Recent advancements in diffusion models have set new benchmarks in image and video generation, enabling realistic visual synthesis across single- and multi-frame contexts. However, these models still struggle with efficiently and explicitly generating 3D-consistent content. To address this, we propose World-consistent Video Diffusion (WVD), a novel framework that incorporates explicit 3D supervision using XYZ images, which encode global 3D coordinates for each image pixel. More specifically, we train a diffusion transformer to learn the joint distribution of RGB and XYZ frames. This approach supports multi-task adaptability via a flexible inpainting strategy. For example, WVD can estimate XYZ frames from ground-truth RGB or generate novel RGB frames using XYZ projections along a specified camera trajectory. In doing so, WVD unifies tasks like single-image-to-3D generation, multi-view stereo, and camera-controlled video generation. Our approach demonstrates competitive performance across multiple benchmarks, providing a scalable solution for 3D-consistent video and image generation with a single pretrained model.

Paper Structure

This paper contains 32 sections, 4 equations, 10 figures, 3 tables.

Figures (10)

  • Figure 1: WVD predicts 6D videos from an image, unifying various 3D tasks with a single diffusion model.
  • Figure 2: An illustration of WVD pipeline. The left part shows 6D videos formed by RGB and XYZ frames. On the right part, WVD iteratively denoises the 6D videos based on a specified RGB frame, which is highlighted with a red box.
  • Figure 3: Illustration of camera-controlled multi-view generation pipeline. We first use WVD to infer the geometry from the input image, and then project it to obtain XYZ images for novel views. Next, we employ an inpainting strategy to sample RGB images.
  • Figure 4: Synthesized Multi-view RGB and XYZ Images by WVD, and associated reconstructed point clouds. Input images are randomly sampled across the validation set.
  • Figure 5: Monocular depth estimation on NYU-v2 Silberman2012 and BONN palazzolo2019refusion benchmarks. We present RGB input images, ground-truth depth maps, and the predicted depth maps from DUSt3R (512 resolution) and WVD, respectively.
  • ...and 5 more figures