Table of Contents
Fetching ...

Prometheus: 3D-Aware Latent Diffusion Models for Feed-Forward Text-to-3D Scene Generation

Yuanbo Yang, Jiahao Shao, Xinyang Li, Yujun Shen, Andreas Geiger, Yiyi Liao

TL;DR

Prometheus addresses the challenge of fast, generalizable text-to-3D scene generation by leveraging large-scale 2D priors through a two-stage latent diffusion framework. It introduces a GS-VAE to learn pixel-aligned 3D Gaussians from RGB-D views and an MV-LDM to denoise multi-view latents conditioned on text and camera poses, producing scene-level 3D Gaussians in seconds. The RGB-D latent space disentangles appearance and geometry, improving fidelity and geometry while enabling efficient feed-forward generation. Across diverse datasets, Prometheus achieves strong reconstruction and 3D generation performance with notable speed advantages over prior baselines.

Abstract

In this work, we introduce Prometheus, a 3D-aware latent diffusion model for text-to-3D generation at both object and scene levels in seconds. We formulate 3D scene generation as multi-view, feed-forward, pixel-aligned 3D Gaussian generation within the latent diffusion paradigm. To ensure generalizability, we build our model upon pre-trained text-to-image generation model with only minimal adjustments, and further train it using a large number of images from both single-view and multi-view datasets. Furthermore, we introduce an RGB-D latent space into 3D Gaussian generation to disentangle appearance and geometry information, enabling efficient feed-forward generation of 3D Gaussians with better fidelity and geometry. Extensive experimental results demonstrate the effectiveness of our method in both feed-forward 3D Gaussian reconstruction and text-to-3D generation. Project page: https://freemty.github.io/project-prometheus/

Prometheus: 3D-Aware Latent Diffusion Models for Feed-Forward Text-to-3D Scene Generation

TL;DR

Prometheus addresses the challenge of fast, generalizable text-to-3D scene generation by leveraging large-scale 2D priors through a two-stage latent diffusion framework. It introduces a GS-VAE to learn pixel-aligned 3D Gaussians from RGB-D views and an MV-LDM to denoise multi-view latents conditioned on text and camera poses, producing scene-level 3D Gaussians in seconds. The RGB-D latent space disentangles appearance and geometry, improving fidelity and geometry while enabling efficient feed-forward generation. Across diverse datasets, Prometheus achieves strong reconstruction and 3D generation performance with notable speed advantages over prior baselines.

Abstract

In this work, we introduce Prometheus, a 3D-aware latent diffusion model for text-to-3D generation at both object and scene levels in seconds. We formulate 3D scene generation as multi-view, feed-forward, pixel-aligned 3D Gaussian generation within the latent diffusion paradigm. To ensure generalizability, we build our model upon pre-trained text-to-image generation model with only minimal adjustments, and further train it using a large number of images from both single-view and multi-view datasets. Furthermore, we introduce an RGB-D latent space into 3D Gaussian generation to disentangle appearance and geometry information, enabling efficient feed-forward generation of 3D Gaussians with better fidelity and geometry. Extensive experimental results demonstrate the effectiveness of our method in both feed-forward 3D Gaussian reconstruction and text-to-3D generation. Project page: https://freemty.github.io/project-prometheus/
Paper Structure (16 sections, 17 equations, 10 figures, 5 tables)

This paper contains 16 sections, 17 equations, 10 figures, 5 tables.

Figures (10)

  • Figure 1: We present Prometheus, a novel method for feed-forward scene-level 3D generation. At its core, our approach harnesses the power of 2D priors to fuel generalizable and efficient 3D synthesis -- hence our name, Prometheus.
  • Figure 2: Method Overview. Our training process is divided into two stages. In stage 1, our objective is to train a GS-VAE. Utilizing multi-view images along with their corresponding pseudo depth maps and camera poses, our GS-VAE is designed to encode these multi-view RGB-D images, integrate cross-view information, and ultimately decode them into pixel-aligned 3DGS. In stage 2, we focus on training a MV-LDM. We can generate multi-view RGB-D latents by sampling from randomly-sampled noise with trained MV-LDM.
  • Figure 3: Qualitative comparison for Stage 1. We compare Prometheus against baselines under varying difficulty settings. As overlap gradually decreases, the advantages of our method continue to grow. Moreover, as shown in the depth map, our method exhibits superior geometry quality across all settings.
  • Figure 4: Qualitative comparison for Stage2: Object-level 3D generation.Prometheus generates objects that align with the given description, incorporating rich background information and intricate details.
  • Figure 5: Qualitative comparison for Stage 2: Scene-level 3D generation with diverse scene-level prompt. Our result better aligns with the text prompt and captures more details.
  • ...and 5 more figures