Table of Contents
Fetching ...

A Recipe for Generating 3D Worlds From a Single Image

Katja Schwarz, Denys Rozumnyi, Samuel Rota Bulò, Lorenzo Porzi, Peter Kontschieder

TL;DR

This work tackles generating immersive 3D worlds from a single image by decomposing the task into two tractable subproblems: panorama synthesis and depth-based lifting with occlusion filling, followed by 3D reconstruction. It introduces a training-free panorama generation pipeline leveraging in-context learning with a ControlNet-enabled inpainting model, and lifts views to metric 3D using monocular depth estimators, supplemented by a fine-tuned, point-cloud-conditioned inpainting step. The core contributions are (i) a panorama-first approach with prompt-driven coherence, (ii) a robust point-cloud-conditioned inpainting strategy including forward-backward warping, and (iii) a 3D Gaussian Splats (3DGS) reconstruction with a learnable distortion module, collectively outperforming state-of-the-art video-based methods on multiple quality metrics. This yields VR-ready, navigable 3D environments from a single image with minimal training, though limitations remain in scene scale, backside completion, and end-to-end speed.

Abstract

We introduce a recipe for generating immersive 3D worlds from a single image by framing the task as an in-context learning problem for 2D inpainting models. This approach requires minimal training and uses existing generative models. Our process involves two steps: generating coherent panoramas using a pre-trained diffusion model and lifting these into 3D with a metric depth estimator. We then fill unobserved regions by conditioning the inpainting model on rendered point clouds, requiring minimal fine-tuning. Tested on both synthetic and real images, our method produces high-quality 3D environments suitable for VR display. By explicitly modeling the 3D structure of the generated environment from the start, our approach consistently outperforms state-of-the-art, video synthesis-based methods along multiple quantitative image quality metrics. Project Page: https://katjaschwarz.github.io/worlds/

A Recipe for Generating 3D Worlds From a Single Image

TL;DR

This work tackles generating immersive 3D worlds from a single image by decomposing the task into two tractable subproblems: panorama synthesis and depth-based lifting with occlusion filling, followed by 3D reconstruction. It introduces a training-free panorama generation pipeline leveraging in-context learning with a ControlNet-enabled inpainting model, and lifts views to metric 3D using monocular depth estimators, supplemented by a fine-tuned, point-cloud-conditioned inpainting step. The core contributions are (i) a panorama-first approach with prompt-driven coherence, (ii) a robust point-cloud-conditioned inpainting strategy including forward-backward warping, and (iii) a 3D Gaussian Splats (3DGS) reconstruction with a learnable distortion module, collectively outperforming state-of-the-art video-based methods on multiple quality metrics. This yields VR-ready, navigable 3D environments from a single image with minimal training, though limitations remain in scene scale, backside completion, and end-to-end speed.

Abstract

We introduce a recipe for generating immersive 3D worlds from a single image by framing the task as an in-context learning problem for 2D inpainting models. This approach requires minimal training and uses existing generative models. Our process involves two steps: generating coherent panoramas using a pre-trained diffusion model and lifting these into 3D with a metric depth estimator. We then fill unobserved regions by conditioning the inpainting model on rendered point clouds, requiring minimal fine-tuning. Tested on both synthetic and real images, our method produces high-quality 3D environments suitable for VR display. By explicitly modeling the 3D structure of the generated environment from the start, our approach consistently outperforms state-of-the-art, video synthesis-based methods along multiple quantitative image quality metrics. Project Page: https://katjaschwarz.github.io/worlds/

Paper Structure

This paper contains 15 sections, 4 equations, 10 figures, 3 tables.

Figures (10)

  • Figure 1: Overview: Given a single input image, our pipeline generates a 360 degree world. The scene is parameterized by Gaussian Splats and can be explored on a VR headset within a cube with 2m side length. Project Page: https://katjaschwarz.github.io/worlds/
  • Figure 2: 3D Worlds: Images rendered from the 3DGS representation generated by our pipeline, given only the single image shown on the left. The orientation of the VR headset in the bottom right corner highlights the direction of the novel views.
  • Figure 3: Panorama Synthesis: Generated panorama images (top) and the respective synthesis heuristic (bottom).
  • Figure 4: Panorama Lifting: Comparison of the lifted point clouds using metric depth estimation (Metric3Dv2) and monocular depth estimation (MoGE). The metric point cloud is distorted and contains prominent artifacts around the center.
  • Figure 5: Panorama Synthesis: We show generated 360 panoramas from a single input image by our method. The reconstructions are consistent and result in accurate 3DGS scenes as visible in Fig. \ref{['fig:worlds']}.
  • ...and 5 more figures