Table of Contents
Fetching ...

Sampling 3D Gaussian Scenes in Seconds with Latent Diffusion Models

Paul Henderson, Melonie de Almeida, Daniela Ivanova, Titas Anciukevičius

TL;DR

This work tackles learning a distribution over real-world 3D scenes from posed multi-view images without explicit 3D supervision. It proposes a two-stage latent diffusion model: an autoencoder maps multi-view inputs to a compact latent per view that decodes into a 3D scene represented by Gaussian Splats, and a diffusion model operates in this latent space to enable fast, probabilistic generation and reconstruction conditioned on images or class labels. The method achieves substantial speedups (as fast as 0.2 seconds per scene) and competitive quality on large, in-the-wild datasets (MVImgNet and RealEstate10K), outperforming several 3D-aware baselines. By learning a true posterior over 3D scenes and avoiding per-scene heavy reconstruction, it enables diverse, controllable, and real-time 3D content synthesis from 2D data with no depth or mask supervision.

Abstract

We present a latent diffusion model over 3D scenes, that can be trained using only 2D image data. To achieve this, we first design an autoencoder that maps multi-view images to 3D Gaussian splats, and simultaneously builds a compressed latent representation of these splats. Then, we train a multi-view diffusion model over the latent space to learn an efficient generative model. This pipeline does not require object masks nor depths, and is suitable for complex scenes with arbitrary camera positions. We conduct careful experiments on two large-scale datasets of complex real-world scenes -- MVImgNet and RealEstate10K. We show that our approach enables generating 3D scenes in as little as 0.2 seconds, either from scratch, from a single input view, or from sparse input views. It produces diverse and high-quality results while running an order of magnitude faster than non-latent diffusion models and earlier NeRF-based generative models

Sampling 3D Gaussian Scenes in Seconds with Latent Diffusion Models

TL;DR

This work tackles learning a distribution over real-world 3D scenes from posed multi-view images without explicit 3D supervision. It proposes a two-stage latent diffusion model: an autoencoder maps multi-view inputs to a compact latent per view that decodes into a 3D scene represented by Gaussian Splats, and a diffusion model operates in this latent space to enable fast, probabilistic generation and reconstruction conditioned on images or class labels. The method achieves substantial speedups (as fast as 0.2 seconds per scene) and competitive quality on large, in-the-wild datasets (MVImgNet and RealEstate10K), outperforming several 3D-aware baselines. By learning a true posterior over 3D scenes and avoiding per-scene heavy reconstruction, it enables diverse, controllable, and real-time 3D content synthesis from 2D data with no depth or mask supervision.

Abstract

We present a latent diffusion model over 3D scenes, that can be trained using only 2D image data. To achieve this, we first design an autoencoder that maps multi-view images to 3D Gaussian splats, and simultaneously builds a compressed latent representation of these splats. Then, we train a multi-view diffusion model over the latent space to learn an efficient generative model. This pipeline does not require object masks nor depths, and is suitable for complex scenes with arbitrary camera positions. We conduct careful experiments on two large-scale datasets of complex real-world scenes -- MVImgNet and RealEstate10K. We show that our approach enables generating 3D scenes in as little as 0.2 seconds, either from scratch, from a single input view, or from sparse input views. It produces diverse and high-quality results while running an order of magnitude faster than non-latent diffusion models and earlier NeRF-based generative models
Paper Structure (34 sections, 2 equations, 9 figures, 3 tables)

This paper contains 34 sections, 2 equations, 9 figures, 3 tables.

Figures (9)

  • Figure 1: Overview of our latent diffusion model for 3D scene synthesis. Left: We train an autoencoder, that encodes (green box; $E$) multi-view images $\{x_v\}_1^V$ to a compressed latent space $\{z_v\}_1^V$. It simultaneously learns to decode (blue box; $D$) the latents to parameters of Gaussian splats $\mathcal{S}$, which can then be rendered back to images $x^*$. Right: We train a denoising diffusion model (pink box; $\bm{v}_\theta$) over the multi-view latent features $z_v$. This supports unconditional generation, or generation conditioned on an input image $x_\mathrm{cond}$ (itself encoded with $E$). Following the efficient, low-dimensional denoising process, the resulting latents are mapped back to a 3D scene by $D$.
  • Figure 2: Qualitative examples of class-conditional (MVImgNet) and unconditional 3D generations (RealEstate10K) from our method. For each example, the top row shows six rendered views of the sampled 3D scene, while the bottom row shows the corresponding depths. Note that our model samples 3D scenes containing objects with complex shape on a realistic background.
  • Figure 3: Qualitative comparison of 3D reconstruction from a single image between our model (top row of each scene) and SplatterImage szymanowicz2024splatter_image (bottom row of each scene) on MVImgNet (a) and RealEstate10k (b). The first column shows the input (conditioning) image, the second displays the ground truth images, while the third and fourth columns display the predicted frames and depths, respectively. Compared to the baseline, our model yields more plausible reconstruction, especially of the occluded regions and the background. Additional examples in the Appendix.
  • Figure 4: Given a single input image from MVImgNet (first column), our model performs 3D reconstruction in a generative manner, and can therefore produce multiple diverse back-views (columns 2 through 7). When compared with the back-view generated by a deterministic model (column 8), our model's predictions are much sharper. Additional examples in the Appendix.
  • Figure 5: Given a single input image from RealEstate10K (first column), our model generates diverse possible completions of parts of the house that are not initially visible. In the top row, the camera moves into to the doorway to the left; in the bottom row, the camera moves along the hallway. In both cases, our model generates diverse samples for the room that is revealed
  • ...and 4 more figures