Baking Gaussian Splatting into Diffusion Denoiser for Fast and Scalable Single-stage Image-to-3D Generation and Reconstruction

Yuanhao Cai; He Zhang; Kai Zhang; Yixun Liang; Mengwei Ren; Fujun Luan; Qing Liu; Soo Ye Kim; Jianming Zhang; Zhifei Zhang; Yuqian Zhou; Yulun Zhang; Xiaokang Yang; Zhe Lin; Alan Yuille

Baking Gaussian Splatting into Diffusion Denoiser for Fast and Scalable Single-stage Image-to-3D Generation and Reconstruction

Yuanhao Cai, He Zhang, Kai Zhang, Yixun Liang, Mengwei Ren, Fujun Luan, Qing Liu, Soo Ye Kim, Jianming Zhang, Zhifei Zhang, Yuqian Zhou, Yulun Zhang, Xiaokang Yang, Zhe Lin, Alan Yuille

TL;DR

DiffusionGS addresses the challenge of single-view 3D generation and reconstruction by baking 3D Gaussian splats into the diffusion denoiser, enabling view-consistent generation without depth estimators. It introduces a scene-object mixed training strategy and a novel camera conditioning RPPC to learn a general 3D prior across objects and scenes. Empirical results show improvements in PSNR and FID over state-of-the-art methods, along with fast ~6s inference on an A100, demonstrating practical scalability for real-world 3D content creation. The approach achieves high-fidelity geometry and texture across prompt views and supports object- and scene-level generation with robust occlusion handling.

Abstract

Existing feedforward image-to-3D methods mainly rely on 2D multi-view diffusion models that cannot guarantee 3D consistency. These methods easily collapse when changing the prompt view direction and mainly handle object-centric cases. In this paper, we propose a novel single-stage 3D diffusion model, DiffusionGS, for object generation and scene reconstruction from a single view. DiffusionGS directly outputs 3D Gaussian point clouds at each timestep to enforce view consistency and allow the model to generate robustly given prompt views of any directions, beyond object-centric inputs. Plus, to improve the capability and generality of DiffusionGS, we scale up 3D training data by developing a scene-object mixed training strategy. Experiments show that DiffusionGS yields improvements of 2.20 dB/23.25 and 1.34 dB/19.16 in PSNR/FID for objects and scenes than the state-of-the-art methods, without depth estimator. Plus, our method enjoys over 5$\times$ faster speed ($\sim$6s on an A100 GPU). Our Project page at https://caiyuanhao1998.github.io/project/DiffusionGS/ shows the video and interactive results. The code and models are publicly available at https://github.com/caiyuanhao1998/Open-DiffusionGS

Baking Gaussian Splatting into Diffusion Denoiser for Fast and Scalable Single-stage Image-to-3D Generation and Reconstruction

TL;DR

Abstract

Baking Gaussian Splatting into Diffusion Denoiser for Fast and Scalable Single-stage Image-to-3D Generation and Reconstruction

TL;DR

Abstract

Paper Structure

Table of Contents

Figures (9)