Table of Contents
Fetching ...

Gen-3Diffusion: Realistic Image-to-3D Generation via 2D & 3D Diffusion Synergy

Yuxuan Xue, Xianghui Xie, Riccardo Marin, Gerard Pons-Moll

TL;DR

Gen-3Diffusion tackles the problem of creating realistic 3D objects and clothed avatars from a single RGB image. It achieves this by synchronizing a pretrained 2D multi-view diffusion model with an explicit 3D Gaussian Splatting diffusion model, enabling 3D-consistent multi-view generation and leveraging 2D priors to guide 3D reconstruction. A 3D-GS diffusion model is trained to predict Gaussian Splat parameters from noisy target views, while a differentiable renderer ensures supervision through rendered views, and a 2D multi-view prior is integrated via one-step target-view estimates to improve 3D fidelity. The two-way diffusion synergy is reinforced by a joint training and a 3D-consistent sampling scheme, yielding state-of-the-art results on object and clothed-avatar tasks with strong generalization and robust ablations; code and pretrained models will be released.

Abstract

Creating realistic 3D objects and clothed avatars from a single RGB image is an attractive yet challenging problem. Due to its ill-posed nature, recent works leverage powerful prior from 2D diffusion models pretrained on large datasets. Although 2D diffusion models demonstrate strong generalization capability, they cannot guarantee the generated multi-view images are 3D consistent. In this paper, we propose Gen-3Diffusion: Realistic Image-to-3D Generation via 2D & 3D Diffusion Synergy. We leverage a pre-trained 2D diffusion model and a 3D diffusion model via our elegantly designed process that synchronizes two diffusion models at both training and sampling time. The synergy between the 2D and 3D diffusion models brings two major advantages: 1) 2D helps 3D in generalization: the pretrained 2D model has strong generalization ability to unseen images, providing strong shape priors for the 3D diffusion model; 2) 3D helps 2D in multi-view consistency: the 3D diffusion model enhances the 3D consistency of 2D multi-view sampling process, resulting in more accurate multi-view generation. We validate our idea through extensive experiments in image-based objects and clothed avatar generation tasks. Results show that our method generates realistic 3D objects and avatars with high-fidelity geometry and texture. Extensive ablations also validate our design choices and demonstrate the strong generalization ability to diverse clothing and compositional shapes. Our code and pretrained models will be publicly released on https://yuxuan-xue.com/gen-3diffusion.

Gen-3Diffusion: Realistic Image-to-3D Generation via 2D & 3D Diffusion Synergy

TL;DR

Gen-3Diffusion tackles the problem of creating realistic 3D objects and clothed avatars from a single RGB image. It achieves this by synchronizing a pretrained 2D multi-view diffusion model with an explicit 3D Gaussian Splatting diffusion model, enabling 3D-consistent multi-view generation and leveraging 2D priors to guide 3D reconstruction. A 3D-GS diffusion model is trained to predict Gaussian Splat parameters from noisy target views, while a differentiable renderer ensures supervision through rendered views, and a 2D multi-view prior is integrated via one-step target-view estimates to improve 3D fidelity. The two-way diffusion synergy is reinforced by a joint training and a 3D-consistent sampling scheme, yielding state-of-the-art results on object and clothed-avatar tasks with strong generalization and robust ablations; code and pretrained models will be released.

Abstract

Creating realistic 3D objects and clothed avatars from a single RGB image is an attractive yet challenging problem. Due to its ill-posed nature, recent works leverage powerful prior from 2D diffusion models pretrained on large datasets. Although 2D diffusion models demonstrate strong generalization capability, they cannot guarantee the generated multi-view images are 3D consistent. In this paper, we propose Gen-3Diffusion: Realistic Image-to-3D Generation via 2D & 3D Diffusion Synergy. We leverage a pre-trained 2D diffusion model and a 3D diffusion model via our elegantly designed process that synchronizes two diffusion models at both training and sampling time. The synergy between the 2D and 3D diffusion models brings two major advantages: 1) 2D helps 3D in generalization: the pretrained 2D model has strong generalization ability to unseen images, providing strong shape priors for the 3D diffusion model; 2) 3D helps 2D in multi-view consistency: the 3D diffusion model enhances the 3D consistency of 2D multi-view sampling process, resulting in more accurate multi-view generation. We validate our idea through extensive experiments in image-based objects and clothed avatar generation tasks. Results show that our method generates realistic 3D objects and avatars with high-fidelity geometry and texture. Extensive ablations also validate our design choices and demonstrate the strong generalization ability to diverse clothing and compositional shapes. Our code and pretrained models will be publicly released on https://yuxuan-xue.com/gen-3diffusion.

Paper Structure

This paper contains 25 sections, 10 equations, 15 figures, 10 tables, 2 algorithms.

Figures (15)

  • Figure 1: Given a single image of a person or an object, our method Gen-3Diffusion creates realistic 3D objects or clothed avatars with high-fidelity geometry and texture. We use Gaussian Splatting to flexibly represent various shapes which can be extracted to high-quality textured meshes.
  • Figure 2: Motivation for generative 3D reconstruction design. Unlike methods hong2023lrmsaito2019pifu that deterministically regress 3D from single images, our Gen-3Diffusion learns conditional distribution and samples a plausible 3D-GS, resulting in high-fidelity and realistic unseen regions.
  • Figure 2: Runtime performance comparison on objects. We evaluate runtime for generating 32 novel views.
  • Figure 3: Motivation for template-free avatar reconstruction design. Methods xiu2023econho2023sith relying on SMPL loper2015smpl template suffer from inaccurate SMPL estimation and cannot represent challenging dresses or object interaction. Our Gen-3Diffusion is template-free and leverages shape prior from 2D diffusion models, can faithfully handle above challenges.
  • Figure 4: Method Overview. Given a single RGB image (A), we sample a realistic 3D object represented as 3D Gaussian Splatting (D) from our learned distribution. At each reverse step, our 3D generation model $g_\phi$ leverages 2D multi-view diffusion prior from $\epsilon_\theta$ which provides a strong shape prior but is not 3D consistent (B, \ref{['sec:2Dhelps3D']}). We then refine the 2D reverse sampling trajectory with generated 3D renderings that are guaranteed to be 3D consistent (C, \ref{['sec:3dhelps2d']}). Our tight coupling ensures 3D consistency at each sampling step and obtains high-quality 3D Gaussian Splats.
  • ...and 10 more figures