Table of Contents
Fetching ...

WonderTurbo: Generating Interactive 3D World in 0.72 Seconds

Chaojun Ni, Xiaofeng Wang, Zheng Zhu, Weijie Wang, Haoyun Li, Guosheng Zhao, Jie Li, Wenkang Qin, Guan Huang, Wenjun Mei

TL;DR

WonderTurbo tackles the bottleneck of real-time interactive 3D generation from a single image by jointly accelerating geometry and appearance modeling. It introduces StepSplat for fast, depth-guided geometry via a memory-augmented cost volume and incremental fusion, QuickDepth for lightweight depth completion, and FastPaint for a 2-step diffusion-based appearance refinement. Together, these enable interactive scene updates in 0.72 seconds with up to a 15× speedup over prior methods while maintaining strong spatial consistency and rendering quality. The approach is validated against both offline and online baselines using CLIP-based metrics and user studies, demonstrating substantial gains in speed with competitive or superior quality and coherence, making real-time interactive 3D editing from a single image practically feasible.

Abstract

Interactive 3D generation is gaining momentum and capturing extensive attention for its potential to create immersive virtual experiences. However, a critical challenge in current 3D generation technologies lies in achieving real-time interactivity. To address this issue, we introduce WonderTurbo, the first real-time interactive 3D scene generation framework capable of generating novel perspectives of 3D scenes within 0.72 seconds. Specifically, WonderTurbo accelerates both geometric and appearance modeling in 3D scene generation. In terms of geometry, we propose StepSplat, an innovative method that constructs efficient 3D geometric representations through dynamic updates, each taking only 0.26 seconds. Additionally, we design QuickDepth, a lightweight depth completion module that provides consistent depth input for StepSplat, further enhancing geometric accuracy. For appearance modeling, we develop FastPaint, a 2-steps diffusion model tailored for instant inpainting, which focuses on maintaining spatial appearance consistency. Experimental results demonstrate that WonderTurbo achieves a remarkable 15X speedup compared to baseline methods, while preserving excellent spatial consistency and delivering high-quality output.

WonderTurbo: Generating Interactive 3D World in 0.72 Seconds

TL;DR

WonderTurbo tackles the bottleneck of real-time interactive 3D generation from a single image by jointly accelerating geometry and appearance modeling. It introduces StepSplat for fast, depth-guided geometry via a memory-augmented cost volume and incremental fusion, QuickDepth for lightweight depth completion, and FastPaint for a 2-step diffusion-based appearance refinement. Together, these enable interactive scene updates in 0.72 seconds with up to a 15× speedup over prior methods while maintaining strong spatial consistency and rendering quality. The approach is validated against both offline and online baselines using CLIP-based metrics and user studies, demonstrating substantial gains in speed with competitive or superior quality and coherence, making real-time interactive 3D editing from a single image practically feasible.

Abstract

Interactive 3D generation is gaining momentum and capturing extensive attention for its potential to create immersive virtual experiences. However, a critical challenge in current 3D generation technologies lies in achieving real-time interactivity. To address this issue, we introduce WonderTurbo, the first real-time interactive 3D scene generation framework capable of generating novel perspectives of 3D scenes within 0.72 seconds. Specifically, WonderTurbo accelerates both geometric and appearance modeling in 3D scene generation. In terms of geometry, we propose StepSplat, an innovative method that constructs efficient 3D geometric representations through dynamic updates, each taking only 0.26 seconds. Additionally, we design QuickDepth, a lightweight depth completion module that provides consistent depth input for StepSplat, further enhancing geometric accuracy. For appearance modeling, we develop FastPaint, a 2-steps diffusion model tailored for instant inpainting, which focuses on maintaining spatial appearance consistency. Experimental results demonstrate that WonderTurbo achieves a remarkable 15X speedup compared to baseline methods, while preserving excellent spatial consistency and delivering high-quality output.

Paper Structure

This paper contains 18 sections, 9 equations, 8 figures, 5 tables.

Figures (8)

  • Figure 1: Beginning with a single image, users can freely adjust the viewpoint and interactively control the generation of a 3D scene, each interaction requiring only 0.72 seconds.
  • Figure 2: The pipeline of WonderTurbo. As the user moves the real-time rendering camera and inputs the text, the rendered image and depth map are then processed by FastPaint and QuickDepth to generate coherent geometry and appearance. Finally, StepSplat performs incremental fusion based on the outputs of FastPaint and QuickDepth.
  • Figure 3: The structure of StepSplat.
  • Figure 4: The process of constructing the interactive 3D generation dataset.
  • Figure 5: Qualitative comparisons of using a fixed panoramic camera path.
  • ...and 3 more figures