Table of Contents
Fetching ...

DSplats: 3D Generation by Denoising Splats-Based Multiview Diffusion Models

Kevin Miao, Harsh Agrawal, Qihang Zhang, Federico Semeraro, Marco Cavallo, Jiatao Gu, Alexander Toshev

TL;DR

DSplats addresses single-image to 3D reconstruction by fusing a pretrained 2D latent diffusion prior with an explicit 3D Gaussian splat representation. It formulates end-to-end diffusion in latent space, using a 3D-aware denoiser to predict Gaussians and render views for cross-view consistency, enabling novel view synthesis from a single input. On Google Scanned Objects, DSplats achieves state-of-the-art results in PSNR, SSIM, and LPIPS while maintaining geometric coherence across views, illustrating the strength of combining 2D priors with explicit 3D representations. By removing the need for per-view optimization, this approach offers a scalable pathway to high-fidelity 3D content from sparse inputs.

Abstract

Generating high-quality 3D content requires models capable of learning robust distributions of complex scenes and the real-world objects within them. Recent Gaussian-based 3D reconstruction techniques have achieved impressive results in recovering high-fidelity 3D assets from sparse input images by predicting 3D Gaussians in a feed-forward manner. However, these techniques often lack the extensive priors and expressiveness offered by Diffusion Models. On the other hand, 2D Diffusion Models, which have been successfully applied to denoise multiview images, show potential for generating a wide range of photorealistic 3D outputs but still fall short on explicit 3D priors and consistency. In this work, we aim to bridge these two approaches by introducing DSplats, a novel method that directly denoises multiview images using Gaussian Splat-based Reconstructors to produce a diverse array of realistic 3D assets. To harness the extensive priors of 2D Diffusion Models, we incorporate a pretrained Latent Diffusion Model into the reconstructor backbone to predict a set of 3D Gaussians. Additionally, the explicit 3D representation embedded in the denoising network provides a strong inductive bias, ensuring geometrically consistent novel view generation. Our qualitative and quantitative experiments demonstrate that DSplats not only produces high-quality, spatially consistent outputs, but also sets a new standard in single-image to 3D reconstruction. When evaluated on the Google Scanned Objects dataset, DSplats achieves a PSNR of 20.38, an SSIM of 0.842, and an LPIPS of 0.109.

DSplats: 3D Generation by Denoising Splats-Based Multiview Diffusion Models

TL;DR

DSplats addresses single-image to 3D reconstruction by fusing a pretrained 2D latent diffusion prior with an explicit 3D Gaussian splat representation. It formulates end-to-end diffusion in latent space, using a 3D-aware denoiser to predict Gaussians and render views for cross-view consistency, enabling novel view synthesis from a single input. On Google Scanned Objects, DSplats achieves state-of-the-art results in PSNR, SSIM, and LPIPS while maintaining geometric coherence across views, illustrating the strength of combining 2D priors with explicit 3D representations. By removing the need for per-view optimization, this approach offers a scalable pathway to high-fidelity 3D content from sparse inputs.

Abstract

Generating high-quality 3D content requires models capable of learning robust distributions of complex scenes and the real-world objects within them. Recent Gaussian-based 3D reconstruction techniques have achieved impressive results in recovering high-fidelity 3D assets from sparse input images by predicting 3D Gaussians in a feed-forward manner. However, these techniques often lack the extensive priors and expressiveness offered by Diffusion Models. On the other hand, 2D Diffusion Models, which have been successfully applied to denoise multiview images, show potential for generating a wide range of photorealistic 3D outputs but still fall short on explicit 3D priors and consistency. In this work, we aim to bridge these two approaches by introducing DSplats, a novel method that directly denoises multiview images using Gaussian Splat-based Reconstructors to produce a diverse array of realistic 3D assets. To harness the extensive priors of 2D Diffusion Models, we incorporate a pretrained Latent Diffusion Model into the reconstructor backbone to predict a set of 3D Gaussians. Additionally, the explicit 3D representation embedded in the denoising network provides a strong inductive bias, ensuring geometrically consistent novel view generation. Our qualitative and quantitative experiments demonstrate that DSplats not only produces high-quality, spatially consistent outputs, but also sets a new standard in single-image to 3D reconstruction. When evaluated on the Google Scanned Objects dataset, DSplats achieves a PSNR of 20.38, an SSIM of 0.842, and an LPIPS of 0.109.

Paper Structure

This paper contains 14 sections, 10 equations, 7 figures, 2 tables.

Figures (7)

  • Figure 1: By leveraging the 2D Diffusion Prior of Latent Diffusion Models and an explicit 3D Gaussian representation, DSplats is able to generate photorealistic 3D objects when provided with a single image input only. These objects can then be rendered from any novel view, including objects in the wild.
  • Figure 2: Qualitative results: provided a single input image of real-world objects, DSplats is able to generate high-quality 3D representations, yielding realistic 3D objects.
  • Figure 3: DSplats: single end-to-end training of an image pretrained and 3D aware diffusion model. During training time, we pass multiview input $X$ through our encoder to yield latents. Gaussian Noise is added to these latents and concatenated channel-wise with the Camera Ray Maps before being fed into the U-Net. The decoder outputs 3D multiview gaussians that are then used to render these multiview images as well as unseen views. The output renders are used to train our reconstruction model using $L_{render}$. Of the denoised output renders, we select the clean multiview images and encode them through our encoder to obtain denoised latents. These are used to train using $L_{diffusion}$.
  • Figure 4: Qualitative comparisons of our results on Google Scanned Objects downs2022google with One-2-3-45 liu2024one and GRM xu2024grm. Provided with a single input image (top row), we render four novel views for each of the methods. For One-2-3-45, we were unable to perfectly match the pose of the multiview images, so we display the image that is the closest approximation. From these results, it becomes clear that DSplats has strong photorealistic outputs (lighting, texture-wise), as well as a strong geometric prior.
  • Figure 5: DSplats can be extended to real-world images, as shown in these objects in-the-wild (left) and the corresponding generated novel views (right).
  • ...and 2 more figures