Table of Contents
Fetching ...

Baking Gaussian Splatting into Diffusion Denoiser for Fast and Scalable Single-stage Image-to-3D Generation and Reconstruction

Yuanhao Cai, He Zhang, Kai Zhang, Yixun Liang, Mengwei Ren, Fujun Luan, Qing Liu, Soo Ye Kim, Jianming Zhang, Zhifei Zhang, Yuqian Zhou, Yulun Zhang, Xiaokang Yang, Zhe Lin, Alan Yuille

TL;DR

DiffusionGS addresses the challenge of single-view 3D generation and reconstruction by baking 3D Gaussian splats into the diffusion denoiser, enabling view-consistent generation without depth estimators. It introduces a scene-object mixed training strategy and a novel camera conditioning RPPC to learn a general 3D prior across objects and scenes. Empirical results show improvements in PSNR and FID over state-of-the-art methods, along with fast ~6s inference on an A100, demonstrating practical scalability for real-world 3D content creation. The approach achieves high-fidelity geometry and texture across prompt views and supports object- and scene-level generation with robust occlusion handling.

Abstract

Existing feedforward image-to-3D methods mainly rely on 2D multi-view diffusion models that cannot guarantee 3D consistency. These methods easily collapse when changing the prompt view direction and mainly handle object-centric cases. In this paper, we propose a novel single-stage 3D diffusion model, DiffusionGS, for object generation and scene reconstruction from a single view. DiffusionGS directly outputs 3D Gaussian point clouds at each timestep to enforce view consistency and allow the model to generate robustly given prompt views of any directions, beyond object-centric inputs. Plus, to improve the capability and generality of DiffusionGS, we scale up 3D training data by developing a scene-object mixed training strategy. Experiments show that DiffusionGS yields improvements of 2.20 dB/23.25 and 1.34 dB/19.16 in PSNR/FID for objects and scenes than the state-of-the-art methods, without depth estimator. Plus, our method enjoys over 5$\times$ faster speed ($\sim$6s on an A100 GPU). Our Project page at https://caiyuanhao1998.github.io/project/DiffusionGS/ shows the video and interactive results. The code and models are publicly available at https://github.com/caiyuanhao1998/Open-DiffusionGS

Baking Gaussian Splatting into Diffusion Denoiser for Fast and Scalable Single-stage Image-to-3D Generation and Reconstruction

TL;DR

DiffusionGS addresses the challenge of single-view 3D generation and reconstruction by baking 3D Gaussian splats into the diffusion denoiser, enabling view-consistent generation without depth estimators. It introduces a scene-object mixed training strategy and a novel camera conditioning RPPC to learn a general 3D prior across objects and scenes. Empirical results show improvements in PSNR and FID over state-of-the-art methods, along with fast ~6s inference on an A100, demonstrating practical scalability for real-world 3D content creation. The approach achieves high-fidelity geometry and texture across prompt views and supports object- and scene-level generation with robust occlusion handling.

Abstract

Existing feedforward image-to-3D methods mainly rely on 2D multi-view diffusion models that cannot guarantee 3D consistency. These methods easily collapse when changing the prompt view direction and mainly handle object-centric cases. In this paper, we propose a novel single-stage 3D diffusion model, DiffusionGS, for object generation and scene reconstruction from a single view. DiffusionGS directly outputs 3D Gaussian point clouds at each timestep to enforce view consistency and allow the model to generate robustly given prompt views of any directions, beyond object-centric inputs. Plus, to improve the capability and generality of DiffusionGS, we scale up 3D training data by developing a scene-object mixed training strategy. Experiments show that DiffusionGS yields improvements of 2.20 dB/23.25 and 1.34 dB/19.16 in PSNR/FID for objects and scenes than the state-of-the-art methods, without depth estimator. Plus, our method enjoys over 5 faster speed (6s on an A100 GPU). Our Project page at https://caiyuanhao1998.github.io/project/DiffusionGS/ shows the video and interactive results. The code and models are publicly available at https://github.com/caiyuanhao1998/Open-DiffusionGS

Paper Structure

This paper contains 11 sections, 13 equations, 9 figures, 2 tables.

Figures (9)

  • Figure 1: Single-view object generation (upper) and scene reconstruction (lower) results of our method. For single-view object generation, the prompt views are shown in the left dashed box. The generated novel views and 3D Gaussian point clouds are depicted on the right. For single-view scene reconstruction, our model can handle hard cases with occlusion and rotation, as illustrated in the dashed boxes of the third row. The prompt views of object and scene text-to-3D demos are generated by stable diffusion stable_diffusion and Sora sora, respectively.
  • Figure 2: Single-view object generation results of our method on GSO gso, wild images, and text-to-images prompted by stable diffusion or FLUX. Our DiffusionGS can robustly handle hard cases with furry appearance, shadow, flat illustration, complex geometry, and specularity.
  • Figure 3: Single-view scene reconstruction of our method on indoor and outdoor scenes. The depth maps are rendered by GS point clouds.
  • Figure 4: Pipeline. (a) When selecting the data for our scene-object mixed training, we impose two angle constraints on the positions and orientations of viewpoint vectors to guarantee the training convergence. (b) The denoiser architecture of DiffusionGS in a single timestep.
  • Figure 5: Plücker ray vs. Reference-Point Plücker Coordinate.
  • ...and 4 more figures