WonderWorld: Interactive 3D Scene Generation from a Single Image

Hong-Xing Yu; Haoyi Duan; Charles Herrmann; William T. Freeman; Jiajun Wu

WonderWorld: Interactive 3D Scene Generation from a Single Image

Hong-Xing Yu, Haoyi Duan, Charles Herrmann, William T. Freeman, Jiajun Wu

TL;DR

WonderWorld delivers interactive 3D scene generation from a single image by introducing FLAGS, a Fast Layered Gaussian Surfels representation, coupled with a geometry-aware initialization to enable sub-second per-layer optimization and sub-10-second per-scene generation on a single GPU. It further mitigates geometric seams across extrapolated scenes through a training-free guided depth diffusion that conditions depth estimates on visible geometry. The system supports real-time user control over camera paths and content prompts to create connected, diverse worlds, demonstrated against strong baselines with quantitative and human-evaluated metrics. Ablation studies confirm the necessity of layered surfels, geometry-based initialization, and depth guidance for quality and consistency, and the work releases full code to promote reproducibility and adoption in VR, gaming, and creative design.

Abstract

We present WonderWorld, a novel framework for interactive 3D scene generation that enables users to interactively specify scene contents and layout and see the created scenes in low latency. The major challenge lies in achieving fast generation of 3D scenes. Existing scene generation approaches fall short of speed as they often require (1) progressively generating many views and depth maps, and (2) time-consuming optimization of the scene geometry representations. We introduce the Fast Layered Gaussian Surfels (FLAGS) as our scene representation and an algorithm to generate it from a single view. Our approach does not need multiple views, and it leverages a geometry-based initialization that significantly reduces optimization time. Another challenge is generating coherent geometry that allows all scenes to be connected. We introduce the guided depth diffusion that allows partial conditioning of depth estimation. WonderWorld generates connected and diverse 3D scenes in less than 10 seconds on a single A6000 GPU, enabling real-time user interaction and exploration. We demonstrate the potential of WonderWorld for user-driven content creation and exploration in virtual environments. We release full code and software for reproducibility. Project website: https://kovenyu.com/WonderWorld/.

WonderWorld: Interactive 3D Scene Generation from a Single Image

TL;DR

Abstract

Paper Structure (45 sections, 9 equations, 15 figures, 6 tables, 2 algorithms)

This paper contains 45 sections, 9 equations, 15 figures, 6 tables, 2 algorithms.

Introduction
Related Work
Novel view generation.
3D world generation.
3D scene generation.
Video generation.
Fast 3D scene representations.
Approach
Formulation.
Overview.
Challenges.
Fast LAyered Gaussian Surfels (FLAGS)
Definition.
Single-view layer generation.
Geometry-based initialization.
...and 30 more sections

Figures (15)

Figure 1: Starting with a single image, a user can interactively generate connected 3D scenes with diverse elements. The user can specify scene contents via text prompts and specify the layout by moving cameras (e.g., panorama-like camera paths as in the top row, or casual-walk camera paths as in the bottom row). We recommend seeing the interactive generation process at https://kovenyu.com/WonderWorld/.
Figure 2: The proposed WonderWorld: Our system takes a single image as input and generates connected diverse 3D scenes. Users can specify where (by moving the real-time rendering camera) and what to generate (by typing text prompts) and see a generated scene in less than 10 seconds. We summarize the outer control loop in Alg. \ref{['alg:control']} in the supplementary material.
Figure 3: Scale initialization of FLAGS: The sampling interval at a surfel is given by $T_\text{N} =d/(f\cos\theta)$.
Figure 4: Illustration of guided depth diffusion. The colored patches indicate that depth is computed in latent space.
Figure 5: Baseline comparison. The inset is the input image. We use a fixed panoramic camera path for evaluation.
...and 10 more figures

WonderWorld: Interactive 3D Scene Generation from a Single Image

TL;DR

Abstract

WonderWorld: Interactive 3D Scene Generation from a Single Image

Authors

TL;DR

Abstract

Table of Contents

Figures (15)