Table of Contents
Fetching ...

WonderVerse: Extendable 3D Scene Generation with Video Generative Models

Hao Feng, Zhi Zuo, Jia-Hui Pan, Ka-Hei Hui, Yihua Shao, Qi Dou, Wei Xie, Zhengzhe Liu

TL;DR

WonderVerse tackles extendable 3D scene generation by shifting from image-based iterative pipelines to a video-driven approach that exploits world-level priors in video foundation models. It introduces a video generation and extension workflow, a COLMAP-based camera parameter estimation step, an abnormal sequence detection mechanism to ensure geometric coherence, and a 3D reconstruction/rendering stage that supports both efficient and high-quality backends. The approach achieves state-of-the-art extendable 3D scenes with improved semantic alignment and perceptual quality on indoor and outdoor scenes, while enabling controllable scene extension and interactive generation. The work demonstrates practical impact by enabling scalable, coherent, and realistic 3D environments from text prompts, suitable for VR/AR, robotics, and simulation, with a simple, modular pipeline.

Abstract

We introduce \textit{WonderVerse}, a simple but effective framework for generating extendable 3D scenes. Unlike existing methods that rely on iterative depth estimation and image inpainting, often leading to geometric distortions and inconsistencies, WonderVerse leverages the powerful world-level priors embedded within video generative foundation models to create highly immersive and geometrically coherent 3D environments. Furthermore, we propose a new technique for controllable 3D scene extension to substantially increase the scale of the generated environments. Besides, we introduce a novel abnormal sequence detection module that utilizes camera trajectory to address geometric inconsistency in the generated videos. Finally, WonderVerse is compatible with various 3D reconstruction methods, allowing both efficient and high-quality generation. Extensive experiments on 3D scene generation demonstrate that our WonderVerse, with an elegant and simple pipeline, delivers extendable and highly-realistic 3D scenes, markedly outperforming existing works that rely on more complex architectures.

WonderVerse: Extendable 3D Scene Generation with Video Generative Models

TL;DR

WonderVerse tackles extendable 3D scene generation by shifting from image-based iterative pipelines to a video-driven approach that exploits world-level priors in video foundation models. It introduces a video generation and extension workflow, a COLMAP-based camera parameter estimation step, an abnormal sequence detection mechanism to ensure geometric coherence, and a 3D reconstruction/rendering stage that supports both efficient and high-quality backends. The approach achieves state-of-the-art extendable 3D scenes with improved semantic alignment and perceptual quality on indoor and outdoor scenes, while enabling controllable scene extension and interactive generation. The work demonstrates practical impact by enabling scalable, coherent, and realistic 3D environments from text prompts, suitable for VR/AR, robotics, and simulation, with a simple, modular pipeline.

Abstract

We introduce \textit{WonderVerse}, a simple but effective framework for generating extendable 3D scenes. Unlike existing methods that rely on iterative depth estimation and image inpainting, often leading to geometric distortions and inconsistencies, WonderVerse leverages the powerful world-level priors embedded within video generative foundation models to create highly immersive and geometrically coherent 3D environments. Furthermore, we propose a new technique for controllable 3D scene extension to substantially increase the scale of the generated environments. Besides, we introduce a novel abnormal sequence detection module that utilizes camera trajectory to address geometric inconsistency in the generated videos. Finally, WonderVerse is compatible with various 3D reconstruction methods, allowing both efficient and high-quality generation. Extensive experiments on 3D scene generation demonstrate that our WonderVerse, with an elegant and simple pipeline, delivers extendable and highly-realistic 3D scenes, markedly outperforming existing works that rely on more complex architectures.

Paper Structure

This paper contains 23 sections, 3 equations, 18 figures, 2 tables, 1 algorithm.

Figures (18)

  • Figure 1: WonderVerse is able to create large-scale, coherent, extendable, and high-quality 3D scenes from a text.
  • Figure 2: Illustration of our WonderVerse. This framework includes: (a) a text-guided video generation and extension module that produces a video of a scene circularly in a continuous shot, followed by extensions to both sides; (b) a camera parameter estimation module that predicts the camera pose sequence; (c) a abnormal sequence detection module that identifies discontinuous camera poses and regenerates the corresponding videos; and (d) a 3D scene reconstruction and rendering module to construct the generated scene.
  • Figure 3: WonderVerse generates large-scale, extendable, cohenrent, and high-fidelity 3D scenes, both indoors and outdoors. Dashed lines show the camera’s direction during scene extension.
  • Figure 4: Qualitative comparison with existing works.
  • Figure 5: Generated 3D scene without and with our abnormal sequence detection module.
  • ...and 13 more figures