Table of Contents
Fetching ...

latentSplat: Autoencoding Variational Gaussians for Fast Generalizable 3D Reconstruction

Christopher Wewer, Kevin Raj, Eddy Ilg, Bernt Schiele, Jan Eric Lenssen

TL;DR

latentsSplat addresses the challenge of scalable, generalizable 3D reconstruction from two views by encoding a scene as $N$ variational 3D Gaussians with parameters $(\mathbf{x}_i, \mathbf{S}_i, \mathbf{R}_i, o_i, \mathbf{c}_i, \mathbf{h}_{\mu,i}, \mathbf{h}_{\sigma,i})$, sampled via the reparameterization trick and decoded by a lightweight 2D VAE-GAN decoder. The method combines regression-guided priors with a probabilistic 3D latent representation and uses an efficient Gaussian splatting renderer to enable fast, high-resolution novel-view synthesis with uncertainty modeling. It is trained end-to-end on real video data using a combination of reconstruction, auxiliary, and GAN losses, and it demonstrates state-of-the-art quality for two-view reconstruction and strong extrapolation capabilities on CO3D and RealEstate10k while maintaining real-time rendering and lower memory requirements than prior generative approaches. This results in 3D-consistent novel views that support downstream mesh reconstruction and scalable applications to large scenes and resolutions.

Abstract

We present latentSplat, a method to predict semantic Gaussians in a 3D latent space that can be splatted and decoded by a light-weight generative 2D architecture. Existing methods for generalizable 3D reconstruction either do not scale to large scenes and resolutions, or are limited to interpolation of close input views. latentSplat combines the strengths of regression-based and generative approaches while being trained purely on readily available real video data. The core of our method are variational 3D Gaussians, a representation that efficiently encodes varying uncertainty within a latent space consisting of 3D feature Gaussians. From these Gaussians, specific instances can be sampled and rendered via efficient splatting and a fast, generative decoder. We show that latentSplat outperforms previous works in reconstruction quality and generalization, while being fast and scalable to high-resolution data.

latentSplat: Autoencoding Variational Gaussians for Fast Generalizable 3D Reconstruction

TL;DR

latentsSplat addresses the challenge of scalable, generalizable 3D reconstruction from two views by encoding a scene as variational 3D Gaussians with parameters , sampled via the reparameterization trick and decoded by a lightweight 2D VAE-GAN decoder. The method combines regression-guided priors with a probabilistic 3D latent representation and uses an efficient Gaussian splatting renderer to enable fast, high-resolution novel-view synthesis with uncertainty modeling. It is trained end-to-end on real video data using a combination of reconstruction, auxiliary, and GAN losses, and it demonstrates state-of-the-art quality for two-view reconstruction and strong extrapolation capabilities on CO3D and RealEstate10k while maintaining real-time rendering and lower memory requirements than prior generative approaches. This results in 3D-consistent novel views that support downstream mesh reconstruction and scalable applications to large scenes and resolutions.

Abstract

We present latentSplat, a method to predict semantic Gaussians in a 3D latent space that can be splatted and decoded by a light-weight generative 2D architecture. Existing methods for generalizable 3D reconstruction either do not scale to large scenes and resolutions, or are limited to interpolation of close input views. latentSplat combines the strengths of regression-based and generative approaches while being trained purely on readily available real video data. The core of our method are variational 3D Gaussians, a representation that efficiently encodes varying uncertainty within a latent space consisting of 3D feature Gaussians. From these Gaussians, specific instances can be sampled and rendered via efficient splatting and a fast, generative decoder. We show that latentSplat outperforms previous works in reconstruction quality and generalization, while being fast and scalable to high-resolution data.
Paper Structure (25 sections, 6 equations, 14 figures, 5 tables)

This paper contains 25 sections, 6 equations, 14 figures, 5 tables.

Figures (14)

  • Figure 1: We present latentSplat, a method for scalable generalizable 3D reconstruction from two reference views (left). We autoencode the views into a 3D latent representation consisting of variational feature Gaussians. From this representation, we can perform fast novel view synthesis (right), generalizing to interpolated and extrapolated views.
  • Figure 1: Comparison with GeNVS chan2023genvs We qualitatively compare against GeNVS. Note that the setups of both methods differ strongly and that GeNVS code is not available for reproducing results. Thus, this comparison is not fair. We selected the same CO3D test examples and compared against renderings shown in their paper. Both methods generate similar quality while ours is much faster (c.f. Sec. 4.6 main paper).
  • Figure 2: latentSplat architecture. The architecture follows an autoencoder structure. (Left) Two input views are encoded into a 3D variational Gaussian representation using an epipolar transformer and a Gaussian sampling head. (Center) Variational Gaussians allow sampling of spherical harmonics feature coefficients that determine a specific instance of semantic Gaussians. (Right) The sampled instance can be rendered efficiently via Gaussian splatting and a light-weight VAE-GAN decoder.
  • Figure 2: Intermediate results for 360° novel view synthesis on CO3D hydrants co3d. For uncertainty on the right, darker regions correspond to higher uncertainty. Features are visualized with PCA dimensionality reduction to 3 dimensions.
  • Figure 3: Qualitative results on the CO3D dataset co3d. We evaluate two-view NVS on hydrants and teddybears. latentSplat synthesizes high-quality 360° novel views, whereas regression-based approaches suffer from uncertainty resulting in blur.
  • ...and 9 more figures