Table of Contents
Fetching ...

WonderWorld: Interactive 3D Scene Generation from a Single Image

Hong-Xing Yu, Haoyi Duan, Charles Herrmann, William T. Freeman, Jiajun Wu

TL;DR

WonderWorld delivers interactive 3D scene generation from a single image by introducing FLAGS, a Fast Layered Gaussian Surfels representation, coupled with a geometry-aware initialization to enable sub-second per-layer optimization and sub-10-second per-scene generation on a single GPU. It further mitigates geometric seams across extrapolated scenes through a training-free guided depth diffusion that conditions depth estimates on visible geometry. The system supports real-time user control over camera paths and content prompts to create connected, diverse worlds, demonstrated against strong baselines with quantitative and human-evaluated metrics. Ablation studies confirm the necessity of layered surfels, geometry-based initialization, and depth guidance for quality and consistency, and the work releases full code to promote reproducibility and adoption in VR, gaming, and creative design.

Abstract

We present WonderWorld, a novel framework for interactive 3D scene generation that enables users to interactively specify scene contents and layout and see the created scenes in low latency. The major challenge lies in achieving fast generation of 3D scenes. Existing scene generation approaches fall short of speed as they often require (1) progressively generating many views and depth maps, and (2) time-consuming optimization of the scene geometry representations. We introduce the Fast Layered Gaussian Surfels (FLAGS) as our scene representation and an algorithm to generate it from a single view. Our approach does not need multiple views, and it leverages a geometry-based initialization that significantly reduces optimization time. Another challenge is generating coherent geometry that allows all scenes to be connected. We introduce the guided depth diffusion that allows partial conditioning of depth estimation. WonderWorld generates connected and diverse 3D scenes in less than 10 seconds on a single A6000 GPU, enabling real-time user interaction and exploration. We demonstrate the potential of WonderWorld for user-driven content creation and exploration in virtual environments. We release full code and software for reproducibility. Project website: https://kovenyu.com/WonderWorld/.

WonderWorld: Interactive 3D Scene Generation from a Single Image

TL;DR

WonderWorld delivers interactive 3D scene generation from a single image by introducing FLAGS, a Fast Layered Gaussian Surfels representation, coupled with a geometry-aware initialization to enable sub-second per-layer optimization and sub-10-second per-scene generation on a single GPU. It further mitigates geometric seams across extrapolated scenes through a training-free guided depth diffusion that conditions depth estimates on visible geometry. The system supports real-time user control over camera paths and content prompts to create connected, diverse worlds, demonstrated against strong baselines with quantitative and human-evaluated metrics. Ablation studies confirm the necessity of layered surfels, geometry-based initialization, and depth guidance for quality and consistency, and the work releases full code to promote reproducibility and adoption in VR, gaming, and creative design.

Abstract

We present WonderWorld, a novel framework for interactive 3D scene generation that enables users to interactively specify scene contents and layout and see the created scenes in low latency. The major challenge lies in achieving fast generation of 3D scenes. Existing scene generation approaches fall short of speed as they often require (1) progressively generating many views and depth maps, and (2) time-consuming optimization of the scene geometry representations. We introduce the Fast Layered Gaussian Surfels (FLAGS) as our scene representation and an algorithm to generate it from a single view. Our approach does not need multiple views, and it leverages a geometry-based initialization that significantly reduces optimization time. Another challenge is generating coherent geometry that allows all scenes to be connected. We introduce the guided depth diffusion that allows partial conditioning of depth estimation. WonderWorld generates connected and diverse 3D scenes in less than 10 seconds on a single A6000 GPU, enabling real-time user interaction and exploration. We demonstrate the potential of WonderWorld for user-driven content creation and exploration in virtual environments. We release full code and software for reproducibility. Project website: https://kovenyu.com/WonderWorld/.
Paper Structure (45 sections, 9 equations, 15 figures, 6 tables, 2 algorithms)

This paper contains 45 sections, 9 equations, 15 figures, 6 tables, 2 algorithms.

Figures (15)

  • Figure 1: Starting with a single image, a user can interactively generate connected 3D scenes with diverse elements. The user can specify scene contents via text prompts and specify the layout by moving cameras (e.g., panorama-like camera paths as in the top row, or casual-walk camera paths as in the bottom row). We recommend seeing the interactive generation process at https://kovenyu.com/WonderWorld/.
  • Figure 2: The proposed WonderWorld: Our system takes a single image as input and generates connected diverse 3D scenes. Users can specify where (by moving the real-time rendering camera) and what to generate (by typing text prompts) and see a generated scene in less than 10 seconds. We summarize the outer control loop in Alg. \ref{['alg:control']} in the supplementary material.
  • Figure 3: Scale initialization of FLAGS: The sampling interval at a surfel is given by $T_\text{N} =d/(f\cos\theta)$.
  • Figure 4: Illustration of guided depth diffusion. The colored patches indicate that depth is computed in latent space.
  • Figure 5: Baseline comparison. The inset is the input image. We use a fixed panoramic camera path for evaluation.
  • ...and 10 more figures