Table of Contents
Fetching ...

DiffSplat: Repurposing Image Diffusion Models for Scalable Gaussian Splat Generation

Chenguo Lin, Panwang Pan, Bangbang Yang, Zeming Li, Yadong Mu

TL;DR

DiffSplat addresses the challenge of scalable 3D content generation from text or a single image by repurposing web-scale 2D diffusion priors into 3D Gaussian splats. It introduces a three-part pipeline with scalable structured splat reconstruction for data curation, a splat-latent VAE, and a DiffSplat generator that jointly optimizes diffusion and rendering losses to ensure 3D coherence. The method supports text- and image-conditioned generation and is compatible with various pretrained diffusion models, enabling rapid adaptation of image-generation techniques to 3D. Empirical results demonstrate state-of-the-art alignment and 3D fidelity across text- and image-conditioned tasks with improved efficiency over prior 3D diffusion approaches.

Abstract

Recent advancements in 3D content generation from text or a single image struggle with limited high-quality 3D datasets and inconsistency from 2D multi-view generation. We introduce DiffSplat, a novel 3D generative framework that natively generates 3D Gaussian splats by taming large-scale text-to-image diffusion models. It differs from previous 3D generative models by effectively utilizing web-scale 2D priors while maintaining 3D consistency in a unified model. To bootstrap the training, a lightweight reconstruction model is proposed to instantly produce multi-view Gaussian splat grids for scalable dataset curation. In conjunction with the regular diffusion loss on these grids, a 3D rendering loss is introduced to facilitate 3D coherence across arbitrary views. The compatibility with image diffusion models enables seamless adaptions of numerous techniques for image generation to the 3D realm. Extensive experiments reveal the superiority of DiffSplat in text- and image-conditioned generation tasks and downstream applications. Thorough ablation studies validate the efficacy of each critical design choice and provide insights into the underlying mechanism.

DiffSplat: Repurposing Image Diffusion Models for Scalable Gaussian Splat Generation

TL;DR

DiffSplat addresses the challenge of scalable 3D content generation from text or a single image by repurposing web-scale 2D diffusion priors into 3D Gaussian splats. It introduces a three-part pipeline with scalable structured splat reconstruction for data curation, a splat-latent VAE, and a DiffSplat generator that jointly optimizes diffusion and rendering losses to ensure 3D coherence. The method supports text- and image-conditioned generation and is compatible with various pretrained diffusion models, enabling rapid adaptation of image-generation techniques to 3D. Empirical results demonstrate state-of-the-art alignment and 3D fidelity across text- and image-conditioned tasks with improved efficiency over prior 3D diffusion approaches.

Abstract

Recent advancements in 3D content generation from text or a single image struggle with limited high-quality 3D datasets and inconsistency from 2D multi-view generation. We introduce DiffSplat, a novel 3D generative framework that natively generates 3D Gaussian splats by taming large-scale text-to-image diffusion models. It differs from previous 3D generative models by effectively utilizing web-scale 2D priors while maintaining 3D consistency in a unified model. To bootstrap the training, a lightweight reconstruction model is proposed to instantly produce multi-view Gaussian splat grids for scalable dataset curation. In conjunction with the regular diffusion loss on these grids, a 3D rendering loss is introduced to facilitate 3D coherence across arbitrary views. The compatibility with image diffusion models enables seamless adaptions of numerous techniques for image generation to the 3D realm. Extensive experiments reveal the superiority of DiffSplat in text- and image-conditioned generation tasks and downstream applications. Thorough ablation studies validate the efficacy of each critical design choice and provide insights into the underlying mechanism.

Paper Structure

This paper contains 37 sections, 6 equations, 14 figures, 5 tables.

Figures (14)

  • Figure 1: Comparison with Previous 3D Diffusion Generative Models. (1) Native 3D methods and (2) rendering-based methods encounter challenges in training 3D diffusion models from scratch with limited 3D data. (3) Reconstruction-based methods struggle with inconsistencies in generated multi-view images. In contrast, (4) DiffSplat leverages pretrained image diffusion models for the direct 3DGS generation, effectively utilizing 2D diffusion priors and maintaining 3D consistency. "GT" refers to ground-truth samples in a 3D representation used for diffusion loss computation.
  • Figure 2: Method Overview. (1) A lightweight reconstruction model provides high-quality structured representation for "pseudo" dataset curation. (2) Image VAE is fine-tuned to encode Gaussian splat properties into a shared latent space. (3) DiffSplat is natively capable of generating 3D contents by image and text conditions utilizing 2D priors from text-to-image diffusion models.
  • Figure 3: Qualitative Results and Comparisons on Text-conditioned 3D Generation. More visualizations of DiffSplat results are provided in Appendix Figure \ref{['fig:apx_t23d']}, \ref{['fig:apx_t23d_2']} and \ref{['fig:apx_t23d_3']}.
  • Figure 4: Qualitative Results and Comparisons on Image-conditioned 3D Generation. More visualizations of DiffSplat results are provided in Appendix Figure \ref{['fig:apx_i23d']}, \ref{['fig:apx_i23d_2']} and \ref{['fig:apx_i23d_3']}.
  • Figure 5: Controllable Generation. ControlNet can seamlessly adapt to DiffSplat for controllable text-to-3D generation in various formats, such as normal and depth maps, and Canny edges.
  • ...and 9 more figures