Table of Contents
Fetching ...

Generative Gaussian Splatting: Generating 3D Scenes with Video Diffusion Priors

Katja Schwarz, Norman Mueller, Peter Kontschieder

TL;DR

This work tackles the challenge of generating photorealistic and 3D-consistent scenes from limited input views by integrating an explicit 3D Gaussian splat representation with a pre-trained latent video diffusion model. GGS predicts a 3D feature field via Gaussian splats from posed images and renders it into feature maps or a 3D radiance field, enabling direct 3D synthesis and improved multi-view consistency. Key contributions include a pose-conditioned diffusion framework, an epipolar transformer to link views, a 3D decoder and optional depth supervision, and an autoregressive scene synthesis capability that scales to multiple references. Empirical results on RealEstate10K and ScanNet++ show substantial gains in 3D consistency (TSED) and 3D scene fidelity (FID/FVD) over strong baselines, illustrating a practical path toward coherent, depth-aware 3D content generation from limited data.

Abstract

Synthesizing consistent and photorealistic 3D scenes is an open problem in computer vision. Video diffusion models generate impressive videos but cannot directly synthesize 3D representations, i.e., lack 3D consistency in the generated sequences. In addition, directly training generative 3D models is challenging due to a lack of 3D training data at scale. In this work, we present Generative Gaussian Splatting (GGS) -- a novel approach that integrates a 3D representation with a pre-trained latent video diffusion model. Specifically, our model synthesizes a feature field parameterized via 3D Gaussian primitives. The feature field is then either rendered to feature maps and decoded into multi-view images, or directly upsampled into a 3D radiance field. We evaluate our approach on two common benchmark datasets for scene synthesis, RealEstate10K and ScanNet+, and find that our proposed GGS model significantly improves both the 3D consistency of the generated multi-view images, and the quality of the generated 3D scenes over all relevant baselines. Compared to a similar model without 3D representation, GGS improves FID on the generated 3D scenes by ~20% on both RealEstate10K and ScanNet+. Project page: https://katjaschwarz.github.io/ggs/

Generative Gaussian Splatting: Generating 3D Scenes with Video Diffusion Priors

TL;DR

This work tackles the challenge of generating photorealistic and 3D-consistent scenes from limited input views by integrating an explicit 3D Gaussian splat representation with a pre-trained latent video diffusion model. GGS predicts a 3D feature field via Gaussian splats from posed images and renders it into feature maps or a 3D radiance field, enabling direct 3D synthesis and improved multi-view consistency. Key contributions include a pose-conditioned diffusion framework, an epipolar transformer to link views, a 3D decoder and optional depth supervision, and an autoregressive scene synthesis capability that scales to multiple references. Empirical results on RealEstate10K and ScanNet++ show substantial gains in 3D consistency (TSED) and 3D scene fidelity (FID/FVD) over strong baselines, illustrating a practical path toward coherent, depth-aware 3D content generation from limited data.

Abstract

Synthesizing consistent and photorealistic 3D scenes is an open problem in computer vision. Video diffusion models generate impressive videos but cannot directly synthesize 3D representations, i.e., lack 3D consistency in the generated sequences. In addition, directly training generative 3D models is challenging due to a lack of 3D training data at scale. In this work, we present Generative Gaussian Splatting (GGS) -- a novel approach that integrates a 3D representation with a pre-trained latent video diffusion model. Specifically, our model synthesizes a feature field parameterized via 3D Gaussian primitives. The feature field is then either rendered to feature maps and decoded into multi-view images, or directly upsampled into a 3D radiance field. We evaluate our approach on two common benchmark datasets for scene synthesis, RealEstate10K and ScanNet+, and find that our proposed GGS model significantly improves both the 3D consistency of the generated multi-view images, and the quality of the generated 3D scenes over all relevant baselines. Compared to a similar model without 3D representation, GGS improves FID on the generated 3D scenes by ~20% on both RealEstate10K and ScanNet+. Project page: https://katjaschwarz.github.io/ggs/

Paper Structure

This paper contains 26 sections, 13 equations, 11 figures, 6 tables.

Figures (11)

  • Figure 1: Overview: Given one or more input images, GGS leverages a video diffusion prior to directly generate a 3D radiance field parameterized via 3D Gaussian primitives. GGS first generates a feature field with a pose-conditional diffusion model and subsequently decodes the feature splats, yielding an explicit 3D representation of the generated scene. Project page: https://katjaschwarz.github.io/ggs/
  • Figure 2: Model Architecture: Our approach, GGS, directly synthesizes a 3D representation, which is parameterized by a set of Gaussian splats $\{\mathbf{g}^m\}$, from a set of posed input images. Specifically, during training we consider a set of posed images $\{\mathbf{I}^m\}$ with associated camera poses $\{\mathbf{p}^m\}$ and corresponding Plücker embeddings $\{\mathbf{P}^m\}$. The images are first encoded into a latent representation $\{\mathbf{z}_0^m\}$, which is then partitioned into $K$ reference images and $L$ target images. We introduce noise only to the latents of the target images $\{\mathbf{z}_{tgt,0}^l\}_{l=1}^L$, while leaving the reference images noise-free. To ensure compatibility with the pre-trained image-to-video diffusion model, we duplicate the reference latents across the channel dimension and concatenate zeros for the target latents. The resulting latents, along with the noise level $\sigma_t$ and Plücker embeddings, are fed into a U-Net architecture that produces intermediate per-latent feature maps. These feature maps are subsequently processed by an epipolar transformer $\mathcal{T}_{epi}$ to predict the parameters of the Gaussian feature splats $\{\mathbf{g}^m\}$. We render both feature maps $\{\mathbf{f}^m\}$ and low-resolution images $\{\mathbf{I}_{LR}^m\}$ for the input views, as well as low-resolution images for $J$ novel views $\{\mathbf{I}_{nv,LR}^j\}_{j=1}^J$ to regularize the 3D representation. Finally, the rendered feature maps are decoded into a weighted combination of sample noise $\mathbf{\xi}^m$ and input latent to predict the noise-free latents $\{\hat{\mathbf{z}}_0^m\}$.
  • Figure 3: Baseline Comparison Given One Reference Image: We show results for the strongest baselines CameraCtrl He2004CVPR and ViewCrafterYu2024ViewCrafter together with our approach without (Ours-No3D) and with 3D representation (GGS). Best viewed zoomed in.
  • Figure 4: Baseline Comparison For View Extrapolation Given Two Reference Images: We show results for the strongest baselines LatentSplat Wewer2024latentsplat and ViewCrafterYu2024ViewCrafter together with our approach without (Ours-No3D) and with 3D representation (GGS). As both reference views are close together, we only include one image for reference. Best viewed zoomed in.
  • Figure 5: 3D Reconstruction Results From Generated Images: We run an off-the-shelf 3DGS optimization on the generated multi-view images of ViewCrafter and GGS(Ours). For ViewCrafter, we use 15,000 optimization steps. For our approach, we only refine the generated splats with the generated multi-view images, using 5,000 iterations. The resulting 3D representation is shown on the left and two rendered views from novel viewpoints are included on the right.
  • ...and 6 more figures