Wonderland: Navigating 3D Scenes from a Single Image
Hanwen Liang, Junli Cao, Vidit Goel, Guocheng Qian, Sergei Korolev, Demetri Terzopoulos, Konstantinos N. Plataniotis, Sergey Tulyakov, Jian Ren
TL;DR
Wonderland tackles single-image 3D scene reconstruction by leveraging camera-guided video diffusion latents and a feed-forward latent reconstruction model, LaLRM, to generate wide-scope 3D Gaussian Splatting representations. It introduces dual-branch conditioning (ControlNet-like and LoRA-based) to achieve precise camera pose control within a video diffusion foundation, while operating in the compressed latent space for efficiency. A progressive training strategy, including in-the-wild data, enables robust generalization to out-of-domain scenes. Extensive experiments across RealEstate10K, DL3DV, Tanks-and-Temples, and Mip-NeRF benchmarks show state-of-the-art performance in 3D consistency, view synthesis quality, and speed, illustrating the effectiveness of latent-space 3D reconstruction guided by video priors.
Abstract
How can one efficiently generate high-quality, wide-scope 3D scenes from arbitrary single images? Existing methods suffer several drawbacks, such as requiring multi-view data, time-consuming per-scene optimization, distorted geometry in occluded areas, and low visual quality in backgrounds. Our novel 3D scene reconstruction pipeline overcomes these limitations to tackle the aforesaid challenge. Specifically, we introduce a large-scale reconstruction model that leverages latents from a video diffusion model to predict 3D Gaussian Splattings of scenes in a feed-forward manner. The video diffusion model is designed to create videos precisely following specified camera trajectories, allowing it to generate compressed video latents that encode multi-view information while maintaining 3D consistency. We train the 3D reconstruction model to operate on the video latent space with a progressive learning strategy, enabling the efficient generation of high-quality, wide-scope, and generic 3D scenes. Extensive evaluations across various datasets affirm that our model significantly outperforms existing single-view 3D scene generation methods, especially with out-of-domain images. Thus, we demonstrate for the first time that a 3D reconstruction model can effectively be built upon the latent space of a diffusion model in order to realize efficient 3D scene generation.
