Table of Contents
Fetching ...

Stepper: Stepwise Immersive Scene Generation with Multiview Panoramas

Felix Wimbauer, Fabian Manhardt, Michael Oechsle, Nikolai Kalischek, Christian Rupprecht, Daniel Cremers, Federico Tombari

Abstract

The synthesis of immersive 3D scenes from text is rapidly maturing, driven by novel video generative models and feed-forward 3D reconstruction, with vast potential in AR/VR and world modeling. While panoramic images have proven effective for scene initialization, existing approaches suffer from a trade-off between visual fidelity and explorability: autoregressive expansion suffers from context drift, while panoramic video generation is limited to low resolution. We present Stepper, a unified framework for text-driven immersive 3D scene synthesis that circumvents these limitations via stepwise panoramic scene expansion. Stepper leverages a novel multi-view 360° diffusion model that enables consistent, high-resolution expansion, coupled with a geometry reconstruction pipeline that enforces geometric coherence. Trained on a new large-scale, multi-view panorama dataset, Stepper achieves state-of-the-art fidelity and structural consistency, outperforming prior approaches, thereby setting a new standard for immersive scene generation.

Stepper: Stepwise Immersive Scene Generation with Multiview Panoramas

Abstract

The synthesis of immersive 3D scenes from text is rapidly maturing, driven by novel video generative models and feed-forward 3D reconstruction, with vast potential in AR/VR and world modeling. While panoramic images have proven effective for scene initialization, existing approaches suffer from a trade-off between visual fidelity and explorability: autoregressive expansion suffers from context drift, while panoramic video generation is limited to low resolution. We present Stepper, a unified framework for text-driven immersive 3D scene synthesis that circumvents these limitations via stepwise panoramic scene expansion. Stepper leverages a novel multi-view 360° diffusion model that enables consistent, high-resolution expansion, coupled with a geometry reconstruction pipeline that enforces geometric coherence. Trained on a new large-scale, multi-view panorama dataset, Stepper achieves state-of-the-art fidelity and structural consistency, outperforming prior approaches, thereby setting a new standard for immersive scene generation.

Paper Structure

This paper contains 47 sections, 3 equations, 12 figures, 3 tables.

Figures (12)

  • Figure 1: Stepper sets a new state-of-the-art quality level of generated explorable 3D scenes. Its core innovation is a novel cubemap-based multi-view panorama diffusion model that ensures high-resolution scene synthesis while facilitating step-wise, coherent scene expansion and high-quality scene reconstruction. Please check out our project page at: https://fwmb.github.io/stepper
  • Figure 2: Method overview. a) Our model generates a new panoramic image from a previously unobserved viewpoint based on a given input panorama. To ensure high quality, we utilize a pre-trained diffusion model with expanded multi-view attention that is instrumental for jointly denoising the high-resolution cubefaces of the newly generated novel-view panorama. b) Our ability to generate novel-view panoramas enables auto-regressive scene generation in all directions of the scene yielding a set of high quality, consistent panoramas that effectively complete the representation of the 3D scene. c) The generated panoramas are processed with the feed-forward reconstruction model, i.e. MapAnything. The output pointcloud serves as the initialization of a custom 3D Gaussian Splatting reconstruction enabling high quality novel view synthesis of the generated 3D scene.
  • Figure 3: Dataset Samples. The dataset generated with Infinigen consists of a diverse set of high quality synthetic panoramas of indoor and outdoor scenes. For every panorama, we rendered a pair from a novel viewpoint enabling the training of the multi-view panorama generation model. All panoramas are aligned to the horizontal line.
  • Figure 4: 3D Scene Generation. We provide visual example of generated novel-view panoramas on the left side. The details of the initial panorama are well preserved and previously unseen regions are filled in. On the right side we show novel-view renderings of the reconstructed scenes indicating the 3D consistency of the generated panoramas.
  • Figure 5: Comparison with Baselines. Given a high quality input panorama, we observe that our approach achieves consistent scene generation while showing significantly more details and sharpness in the rendered novel view images in comparison to the baselines.
  • ...and 7 more figures