Table of Contents
Fetching ...

Wonderland: Navigating 3D Scenes from a Single Image

Hanwen Liang, Junli Cao, Vidit Goel, Guocheng Qian, Sergei Korolev, Demetri Terzopoulos, Konstantinos N. Plataniotis, Sergey Tulyakov, Jian Ren

TL;DR

Wonderland tackles single-image 3D scene reconstruction by leveraging camera-guided video diffusion latents and a feed-forward latent reconstruction model, LaLRM, to generate wide-scope 3D Gaussian Splatting representations. It introduces dual-branch conditioning (ControlNet-like and LoRA-based) to achieve precise camera pose control within a video diffusion foundation, while operating in the compressed latent space for efficiency. A progressive training strategy, including in-the-wild data, enables robust generalization to out-of-domain scenes. Extensive experiments across RealEstate10K, DL3DV, Tanks-and-Temples, and Mip-NeRF benchmarks show state-of-the-art performance in 3D consistency, view synthesis quality, and speed, illustrating the effectiveness of latent-space 3D reconstruction guided by video priors.

Abstract

How can one efficiently generate high-quality, wide-scope 3D scenes from arbitrary single images? Existing methods suffer several drawbacks, such as requiring multi-view data, time-consuming per-scene optimization, distorted geometry in occluded areas, and low visual quality in backgrounds. Our novel 3D scene reconstruction pipeline overcomes these limitations to tackle the aforesaid challenge. Specifically, we introduce a large-scale reconstruction model that leverages latents from a video diffusion model to predict 3D Gaussian Splattings of scenes in a feed-forward manner. The video diffusion model is designed to create videos precisely following specified camera trajectories, allowing it to generate compressed video latents that encode multi-view information while maintaining 3D consistency. We train the 3D reconstruction model to operate on the video latent space with a progressive learning strategy, enabling the efficient generation of high-quality, wide-scope, and generic 3D scenes. Extensive evaluations across various datasets affirm that our model significantly outperforms existing single-view 3D scene generation methods, especially with out-of-domain images. Thus, we demonstrate for the first time that a 3D reconstruction model can effectively be built upon the latent space of a diffusion model in order to realize efficient 3D scene generation.

Wonderland: Navigating 3D Scenes from a Single Image

TL;DR

Wonderland tackles single-image 3D scene reconstruction by leveraging camera-guided video diffusion latents and a feed-forward latent reconstruction model, LaLRM, to generate wide-scope 3D Gaussian Splatting representations. It introduces dual-branch conditioning (ControlNet-like and LoRA-based) to achieve precise camera pose control within a video diffusion foundation, while operating in the compressed latent space for efficiency. A progressive training strategy, including in-the-wild data, enables robust generalization to out-of-domain scenes. Extensive experiments across RealEstate10K, DL3DV, Tanks-and-Temples, and Mip-NeRF benchmarks show state-of-the-art performance in 3D consistency, view synthesis quality, and speed, illustrating the effectiveness of latent-space 3D reconstruction guided by video priors.

Abstract

How can one efficiently generate high-quality, wide-scope 3D scenes from arbitrary single images? Existing methods suffer several drawbacks, such as requiring multi-view data, time-consuming per-scene optimization, distorted geometry in occluded areas, and low visual quality in backgrounds. Our novel 3D scene reconstruction pipeline overcomes these limitations to tackle the aforesaid challenge. Specifically, we introduce a large-scale reconstruction model that leverages latents from a video diffusion model to predict 3D Gaussian Splattings of scenes in a feed-forward manner. The video diffusion model is designed to create videos precisely following specified camera trajectories, allowing it to generate compressed video latents that encode multi-view information while maintaining 3D consistency. We train the 3D reconstruction model to operate on the video latent space with a progressive learning strategy, enabling the efficient generation of high-quality, wide-scope, and generic 3D scenes. Extensive evaluations across various datasets affirm that our model significantly outperforms existing single-view 3D scene generation methods, especially with out-of-domain images. Thus, we demonstrate for the first time that a 3D reconstruction model can effectively be built upon the latent space of a diffusion model in order to realize efficient 3D scene generation.

Paper Structure

This paper contains 23 sections, 1 equation, 10 figures, 5 tables.

Figures (10)

  • Figure 1: Overview of Wonderland. Given a single image, a camera-guided video diffusion model follows the camera trajectory and generates a 3D-aware video latent. This is leveraged by the Latent Large Reconstruction Model (LaLRM) to construct the 3D scene in a feed-forward manner. The video diffusion model incorporates dual-branch camera conditioning to achieve precise pose control. The LaLRM operates in the video latent space and efficiently reconstructs a wide-scope, high-fidelity 3D scene.
  • Figure 2: Qualitative comparison against prior arts in camera-guided video generation. Frame 14 in each sample is shown for comparison, with the first column displaying the conditional image and camera trajectory (bottom-right). Blue bounding boxes denote reference areas to assist comparison and orange bounding boxes highlight low-quality generations. We also show our last frames in the rightmost column. Our method outperforms the priors in both precise camera control and high-quality and wide-scope video generation.
  • Figure 3: Qualitative comparison of 3D scene generation. Blue bounding boxes show visible regions from conditional images and yellow bounding boxes show low-quality regions. Our approach generates much higher quality novel views from one conditional image. Note that ZeroNVS generations have a square resolution.
  • Figure 4: Comparison with ViewCrafter (left) and WonderJourney (right) for in-the-wild 3D scene generation from single input images.
  • Figure 5: Comparison of ZeroNVS and Cat3D on the Mip-Nerf dataset in 3D scene generation from single input images. For each scene, the conditional image is shown in the left column along with renderings from two viewpoints, one at the conditional image (starting) view (upper) and another at around a 120° rotation from the starting view (lower).
  • ...and 5 more figures