Gen-3Diffusion: Realistic Image-to-3D Generation via 2D & 3D Diffusion Synergy
Yuxuan Xue, Xianghui Xie, Riccardo Marin, Gerard Pons-Moll
TL;DR
Gen-3Diffusion tackles the problem of creating realistic 3D objects and clothed avatars from a single RGB image. It achieves this by synchronizing a pretrained 2D multi-view diffusion model with an explicit 3D Gaussian Splatting diffusion model, enabling 3D-consistent multi-view generation and leveraging 2D priors to guide 3D reconstruction. A 3D-GS diffusion model is trained to predict Gaussian Splat parameters from noisy target views, while a differentiable renderer ensures supervision through rendered views, and a 2D multi-view prior is integrated via one-step target-view estimates to improve 3D fidelity. The two-way diffusion synergy is reinforced by a joint training and a 3D-consistent sampling scheme, yielding state-of-the-art results on object and clothed-avatar tasks with strong generalization and robust ablations; code and pretrained models will be released.
Abstract
Creating realistic 3D objects and clothed avatars from a single RGB image is an attractive yet challenging problem. Due to its ill-posed nature, recent works leverage powerful prior from 2D diffusion models pretrained on large datasets. Although 2D diffusion models demonstrate strong generalization capability, they cannot guarantee the generated multi-view images are 3D consistent. In this paper, we propose Gen-3Diffusion: Realistic Image-to-3D Generation via 2D & 3D Diffusion Synergy. We leverage a pre-trained 2D diffusion model and a 3D diffusion model via our elegantly designed process that synchronizes two diffusion models at both training and sampling time. The synergy between the 2D and 3D diffusion models brings two major advantages: 1) 2D helps 3D in generalization: the pretrained 2D model has strong generalization ability to unseen images, providing strong shape priors for the 3D diffusion model; 2) 3D helps 2D in multi-view consistency: the 3D diffusion model enhances the 3D consistency of 2D multi-view sampling process, resulting in more accurate multi-view generation. We validate our idea through extensive experiments in image-based objects and clothed avatar generation tasks. Results show that our method generates realistic 3D objects and avatars with high-fidelity geometry and texture. Extensive ablations also validate our design choices and demonstrate the strong generalization ability to diverse clothing and compositional shapes. Our code and pretrained models will be publicly released on https://yuxuan-xue.com/gen-3diffusion.
