Table of Contents
Fetching ...

Scene123: One Prompt to 3D Scene Generation via Video-Assisted and Consistency-Enhanced MAE

Yiying Yang, Fukun Yin, Jiayuan Fan, Xin Chen, Wanzhang Li, Gang Yu

TL;DR

Scene123 tackles the challenge of generating realistic and view-consistent 3D scenes from a single prompt by integrating a Consistency-Enhanced MAE for multi-view completion with a NeRF-based 3D representation. A discrete codebook with cross-attention seeds a globally coherent scene database, while depth alignment ensures consistency between rendered and estimated depths. The pipeline is further strengthened by a video-assisted 3D-aware refinement employing a GAN-like discriminator and a Stable Video Diffusion–generated support set to enhance textures and geometry. Quantitative and qualitative evaluations against state-of-the-art baselines show superior view consistency, texture fidelity, and semantic alignment with the input, supported by a user study. Overall, Scene123 demonstrates that combining MAE-based completion, robust 3D representations, and video priors enables high-quality single-prompt 3D scene generation with practical implications for AIGC content creation.

Abstract

As Artificial Intelligence Generated Content (AIGC) advances, a variety of methods have been developed to generate text, images, videos, and 3D objects from single or multimodal inputs, contributing efforts to emulate human-like cognitive content creation. However, generating realistic large-scale scenes from a single input presents a challenge due to the complexities involved in ensuring consistency across extrapolated views generated by models. Benefiting from recent video generation models and implicit neural representations, we propose Scene123, a 3D scene generation model, that not only ensures realism and diversity through the video generation framework but also uses implicit neural fields combined with Masked Autoencoders (MAE) to effectively ensures the consistency of unseen areas across views. Specifically, we initially warp the input image (or an image generated from text) to simulate adjacent views, filling the invisible areas with the MAE model. However, these filled images usually fail to maintain view consistency, thus we utilize the produced views to optimize a neural radiance field, enhancing geometric consistency. Moreover, to further enhance the details and texture fidelity of generated views, we employ a GAN-based Loss against images derived from the input image through the video generation model. Extensive experiments demonstrate that our method can generate realistic and consistent scenes from a single prompt. Both qualitative and quantitative results indicate that our approach surpasses existing state-of-the-art methods. We show encourage video examples at https://yiyingyang12.github.io/Scene123.github.io/.

Scene123: One Prompt to 3D Scene Generation via Video-Assisted and Consistency-Enhanced MAE

TL;DR

Scene123 tackles the challenge of generating realistic and view-consistent 3D scenes from a single prompt by integrating a Consistency-Enhanced MAE for multi-view completion with a NeRF-based 3D representation. A discrete codebook with cross-attention seeds a globally coherent scene database, while depth alignment ensures consistency between rendered and estimated depths. The pipeline is further strengthened by a video-assisted 3D-aware refinement employing a GAN-like discriminator and a Stable Video Diffusion–generated support set to enhance textures and geometry. Quantitative and qualitative evaluations against state-of-the-art baselines show superior view consistency, texture fidelity, and semantic alignment with the input, supported by a user study. Overall, Scene123 demonstrates that combining MAE-based completion, robust 3D representations, and video priors enables high-quality single-prompt 3D scene generation with practical implications for AIGC content creation.

Abstract

As Artificial Intelligence Generated Content (AIGC) advances, a variety of methods have been developed to generate text, images, videos, and 3D objects from single or multimodal inputs, contributing efforts to emulate human-like cognitive content creation. However, generating realistic large-scale scenes from a single input presents a challenge due to the complexities involved in ensuring consistency across extrapolated views generated by models. Benefiting from recent video generation models and implicit neural representations, we propose Scene123, a 3D scene generation model, that not only ensures realism and diversity through the video generation framework but also uses implicit neural fields combined with Masked Autoencoders (MAE) to effectively ensures the consistency of unseen areas across views. Specifically, we initially warp the input image (or an image generated from text) to simulate adjacent views, filling the invisible areas with the MAE model. However, these filled images usually fail to maintain view consistency, thus we utilize the produced views to optimize a neural radiance field, enhancing geometric consistency. Moreover, to further enhance the details and texture fidelity of generated views, we employ a GAN-based Loss against images derived from the input image through the video generation model. Extensive experiments demonstrate that our method can generate realistic and consistent scenes from a single prompt. Both qualitative and quantitative results indicate that our approach surpasses existing state-of-the-art methods. We show encourage video examples at https://yiyingyang12.github.io/Scene123.github.io/.
Paper Structure (18 sections, 8 equations, 13 figures, 4 tables)

This paper contains 18 sections, 8 equations, 13 figures, 4 tables.

Figures (13)

  • Figure 1: Some examples generated by our Scene123. For a single input image or text, our method can generate 3D scenes with consistent views, fine geometry, and realistic textures, applicable to real, virtual, or object-centered scenes.
  • Figure 2: Scene123's pipeline includes two key modules: the consistency-enhanced MAE and the 3D-aware generative refinement module. The former generates adjacent views from an input image via warping, using the MAE model to inpaint unseen areas with global semantics and optimizing an implicit neural field for viewpoint consistency. The latter generates realistic videos from the input image with a pre-trained video generation model, enhancing realism through adversarial loss with rendered images.
  • Figure 3: Qualitative results (zoom-in to view better) of methods capable of processing a single image prompt. We both visualize the texture and depth from novel views within the scene.
  • Figure 4: Data samples generated via image-to-video model.
  • Figure 5: Qualitative results (zoom-in to view better) of methods that generate scenes from textual input. We both visualize the texture and depth from novel views within the scene.
  • ...and 8 more figures