Table of Contents
Fetching ...

DiffHuman: Probabilistic Photorealistic 3D Reconstruction of Humans

Akash Sengupta, Thiemo Alldieck, Nikos Kolotouros, Enric Corona, Andrei Zanfir, Cristian Sminchisescu

TL;DR

DiffHuman tackles the challenge of single-image photorealistic 3D human reconstruction by modeling a distribution p(\mathcal{S}|\mathbf{I}) over implicit surfaces conditioned on an input image. It combines a conditional diffusion process with pixel-aligned observations and an intermediate neural implicit surface, enabling sampling of multiple input-consistent yet diverse 3D avatars. To address computational cost, a novel hybrid diffusion framework uses a generator to imitate rendering, delivering up to 55× speedups while preserving detail on unseen regions. Evaluations show competitive 3D metrics and improved texture and geometry in occluded regions, highlighting practical applicability for avatar creation and related applications. The approach supports rich diversity across samples and points toward future work with weaker supervision and broader data sources.

Abstract

We present DiffHuman, a probabilistic method for photorealistic 3D human reconstruction from a single RGB image. Despite the ill-posed nature of this problem, most methods are deterministic and output a single solution, often resulting in a lack of geometric detail and blurriness in unseen or uncertain regions. In contrast, DiffHuman predicts a probability distribution over 3D reconstructions conditioned on an input 2D image, which allows us to sample multiple detailed 3D avatars that are consistent with the image. DiffHuman is implemented as a conditional diffusion model that denoises pixel-aligned 2D observations of an underlying 3D shape representation. During inference, we may sample 3D avatars by iteratively denoising 2D renders of the predicted 3D representation. Furthermore, we introduce a generator neural network that approximates rendering with considerably reduced runtime (55x speed up), resulting in a novel dual-branch diffusion framework. Our experiments show that DiffHuman can produce diverse and detailed reconstructions for the parts of the person that are unseen or uncertain in the input image, while remaining competitive with the state-of-the-art when reconstructing visible surfaces.

DiffHuman: Probabilistic Photorealistic 3D Reconstruction of Humans

TL;DR

DiffHuman tackles the challenge of single-image photorealistic 3D human reconstruction by modeling a distribution p(\mathcal{S}|\mathbf{I}) over implicit surfaces conditioned on an input image. It combines a conditional diffusion process with pixel-aligned observations and an intermediate neural implicit surface, enabling sampling of multiple input-consistent yet diverse 3D avatars. To address computational cost, a novel hybrid diffusion framework uses a generator to imitate rendering, delivering up to 55× speedups while preserving detail on unseen regions. Evaluations show competitive 3D metrics and improved texture and geometry in occluded regions, highlighting practical applicability for avatar creation and related applications. The approach supports rich diversity across samples and points toward future work with weaker supervision and broader data sources.

Abstract

We present DiffHuman, a probabilistic method for photorealistic 3D human reconstruction from a single RGB image. Despite the ill-posed nature of this problem, most methods are deterministic and output a single solution, often resulting in a lack of geometric detail and blurriness in unseen or uncertain regions. In contrast, DiffHuman predicts a probability distribution over 3D reconstructions conditioned on an input 2D image, which allows us to sample multiple detailed 3D avatars that are consistent with the image. DiffHuman is implemented as a conditional diffusion model that denoises pixel-aligned 2D observations of an underlying 3D shape representation. During inference, we may sample 3D avatars by iteratively denoising 2D renders of the predicted 3D representation. Furthermore, we introduce a generator neural network that approximates rendering with considerably reduced runtime (55x speed up), resulting in a novel dual-branch diffusion framework. Our experiments show that DiffHuman can produce diverse and detailed reconstructions for the parts of the person that are unseen or uncertain in the input image, while remaining competitive with the state-of-the-art when reconstructing visible surfaces.
Paper Structure (23 sections, 13 equations, 10 figures, 7 tables)

This paper contains 23 sections, 13 equations, 10 figures, 7 tables.

Figures (10)

  • Figure 1: DiffHuman predicts a probability distribution over 3D human reconstructions conditioned on a single monocular RGB image. This enables us to sample multiple plausible, diverse and input-consistent reconstructions during inference. Samples from DiffHuman demonstrate a high level of geometric and colour-wise detail, particularly in unseen and uncertain regions of the human body surface.
  • Figure 2: Method overview. We use a diffusion probabilistic model ho2020denoising to predict a distribution over plausible 3D reconstructions conditioned on a single RGB image. During training, we predict noise-dependent pixel-aligned features $g^{(t)}_\Theta(\boldsymbol x_t, \mathbf I)$ given a noisy observation set $\boldsymbol x_t$ consisting of front/back albedo, depth and normal renders, and an RGB image $\mathbf{I}$. These features condition an SDF $f^{(t)}_\Theta$, which is dependent on both $\boldsymbol x_t$ and $\mathbf{I}$. $f^{(t)}_\Theta$ and $g^{(t)}_\Theta$ are neural networks that define an implicit surface $\mathcal{S}^{(t)}_\Theta(\mathbf{\boldsymbol x_t, I})$. Then, we obtain an estimate of the denoised observation set $\boldsymbol x_{0_\Theta}^{(t)}$ by rendering $\mathcal{S}^{(t)}_\Theta$. We may additionally produce a shaded image $\mathbf{C}^{(t)}$ by applying a pixel-wise noise-dependent shading network $s^{(t)}_\Theta$. During inference, we can sample trajectories over observation sets $\boldsymbol x_{0:T} \sim p_\Theta(\boldsymbol x_{0:T} | \mathbf{I})$ by computing and rendering $\mathcal{S}^{(t)}_\Theta(\mathbf{\boldsymbol x_t, I})$ in each denoising step. Our final 3D samples $\mathcal{S} \sim p_\Theta(\mathcal{S}| \mathbf{I})$ are obtained as the final reconstruction $\mathcal{S} = \mathcal{S}_\Theta^{(1)}(\boldsymbol x_1, \mathbf I)$. To mitigate the computational cost of rendering an implicit surface in every step, we train a "generator" network $h_\Theta^{(t)}$ that imitates rendering by directly mapping $g^{(t)}_\Theta(\boldsymbol x_t, \mathbf I)$ to $\boldsymbol x_{0_\Theta}^{(t)}$. During inference, we denoise using $h_\Theta^{(t)}$ and only explicitly compute the 3D reconstruction in the last step.
  • Figure 3: Qualitative comparison against deterministic monocular 3D human reconstruction methods alldieck2022phorhumcorona2023s3f that predict geometry, surface albedo and shaded colour. PHORHUM alldieck2022phorhum (retrained on our dataset) outputs good front predictions, but exhibits over-smooth, flat geometry and blurry colours on the back. S3F corona2023s3f yields more detailed geometry, but colours are still often blurry. Moreover, both these methods occasionally paste the front colour predictions onto the back incorrectly (see row 3). Our method outputs multiple diverse samples, with a greater level of geometric detail and colour sharpness in uncertain regions, that are consistent with the input image after shading.
  • Figure 4: Qualitative comparison against deterministic monocular 3D human reconstruction methods that predict only surface geometry: PIFuHD saito2020pifuhd, ICON xiu2022icon and ECON xiu2023econ. Samples from our method generally exhibit greater geometric detail in uncertain regions, while maintaining a high level of consistency with the input image in shaded renders. Moreover, deterministic methods often fall back towards the mean of the training data distribution when faced with ambiguous and challenging inputs bishop94mixturedensitycui2020learningmathieu2015deep; e.g. predicting trousers from the back instead of a long skirt in row 3. This can be mitigated by learning to predict a distribution over reconstructions instead.
  • Figure 5: Visualisation of the reverse process. The denoising trajectory shows noisy samples $\boldsymbol x_t$ and generated clean predictions $\bar{\boldsymbol x}^{(t)}_{0_\Theta}$ at each timestep. Clean predictions are initially very simple, akin to many deterministic approaches, and become detailed over time. The heatmaps show sample diversity, computed as the per-pixel variance of the observations in $\bar{\boldsymbol x}^{(t)}_{0_\Theta}$ over 10 samples. Diversity is low at the start of the denoising process ($t = 1000$), but increases gradually as the samples diverge. Back diversity is, intuitively, greater than the front.
  • ...and 5 more figures