Table of Contents
Fetching ...

Multi-hypotheses Conditioned Point Cloud Diffusion for 3D Human Reconstruction from Occluded Images

Donghwan Kim, Tae-Kyun Kim

TL;DR

A novel pipeline, MHCDIFF, Multi-hypotheses Conditioned Point Cloud Diffusion, composed of point cloud diffusion conditioned on probabilistic distributions for pixel-aligned detailed 3D human reconstruction under occlusion, outperforms various SOTA methods based on SMPL, implicit functions, point cloud diffusion, and their combined, under synthetic and real occlusions.

Abstract

3D human shape reconstruction under severe occlusion due to human-object or human-human interaction is a challenging problem. Parametric models i.e., SMPL(-X), which are based on the statistics across human shapes, can represent whole human body shapes but are limited to minimally-clothed human shapes. Implicit-function-based methods extract features from the parametric models to employ prior knowledge of human bodies and can capture geometric details such as clothing and hair. However, they often struggle to handle misaligned parametric models and inpaint occluded regions given a single RGB image. In this work, we propose a novel pipeline, MHCDIFF, Multi-hypotheses Conditioned Point Cloud Diffusion, composed of point cloud diffusion conditioned on probabilistic distributions for pixel-aligned detailed 3D human reconstruction under occlusion. Compared to previous implicit-function-based methods, the point cloud diffusion model can capture the global consistent features to generate the occluded regions, and the denoising process corrects the misaligned SMPL meshes. The core of MHCDIFF is extracting local features from multiple hypothesized SMPL(-X) meshes and aggregating the set of features to condition the diffusion model. In the experiments on CAPE and MultiHuman datasets, the proposed method outperforms various SOTA methods based on SMPL, implicit functions, point cloud diffusion, and their combined, under synthetic and real occlusions. Our code is publicly available at https://donghwankim0101.github.io/projects/mhcdiff/ .

Multi-hypotheses Conditioned Point Cloud Diffusion for 3D Human Reconstruction from Occluded Images

TL;DR

A novel pipeline, MHCDIFF, Multi-hypotheses Conditioned Point Cloud Diffusion, composed of point cloud diffusion conditioned on probabilistic distributions for pixel-aligned detailed 3D human reconstruction under occlusion, outperforms various SOTA methods based on SMPL, implicit functions, point cloud diffusion, and their combined, under synthetic and real occlusions.

Abstract

3D human shape reconstruction under severe occlusion due to human-object or human-human interaction is a challenging problem. Parametric models i.e., SMPL(-X), which are based on the statistics across human shapes, can represent whole human body shapes but are limited to minimally-clothed human shapes. Implicit-function-based methods extract features from the parametric models to employ prior knowledge of human bodies and can capture geometric details such as clothing and hair. However, they often struggle to handle misaligned parametric models and inpaint occluded regions given a single RGB image. In this work, we propose a novel pipeline, MHCDIFF, Multi-hypotheses Conditioned Point Cloud Diffusion, composed of point cloud diffusion conditioned on probabilistic distributions for pixel-aligned detailed 3D human reconstruction under occlusion. Compared to previous implicit-function-based methods, the point cloud diffusion model can capture the global consistent features to generate the occluded regions, and the denoising process corrects the misaligned SMPL meshes. The core of MHCDIFF is extracting local features from multiple hypothesized SMPL(-X) meshes and aggregating the set of features to condition the diffusion model. In the experiments on CAPE and MultiHuman datasets, the proposed method outperforms various SOTA methods based on SMPL, implicit functions, point cloud diffusion, and their combined, under synthetic and real occlusions. Our code is publicly available at https://donghwankim0101.github.io/projects/mhcdiff/ .
Paper Structure (28 sections, 5 equations, 7 figures, 4 tables, 2 algorithms)

This paper contains 28 sections, 5 equations, 7 figures, 4 tables, 2 algorithms.

Figures (7)

  • Figure 1: Image to 3D shape. From the segmented images, containing occlusion due to interaction, MHCDiff reconstructs 3D human shapes as point clouds.
  • Figure 2: (Left) Overview of MHCDiff. Given an occluded image $I$, MHCDiff reconstructs 3D human shape as a point cloud. First, we extract the 2D feature map $\mathcal{E}(I)$ and hypothesize pose and shape parameters of multiple plausible SMPL meshes $\{ S_i \}_{i\in \{ 1, ..., s \}}$. Our method consists of the conditioned point cloud diffusion model (Sec. \ref{['Subsec:4.4']}). We project the 2D image features to capture details of the image (Sec. \ref{['Sec:3']}) and extract local features from multiple hypothesized SMPL meshes to leverage human body priors (Sec. \ref{['Subsec:4.3']}) (Upper Right) The details of local features (Sec. \ref{['Subsec:4.2']}). The signed distance field is visualized in positive and negative regions. The arrows indicate normal vectors $\boldsymbol{n}$. (Lower Right) The details of multi-hypotheses (Sec. \ref{['Subsec:4.3']}). We can consider the whole distribution during denoising process with the argmax $\bar{i}$, and the denoising can be approximated by red arrows. However, it is sensitive to extreme samples of the distribution, so we condition the mean of occupancy values, which is visualized by transparency, and the denoising can be approximated by blue arrows.
  • Figure 3: A cumulative occlusion-to-reconstruction test. This figure shows the performance of different models from the images of various occlusion ratios. From the whole-body images, which is 0% occlusion, we randomly mask the images from 10% to 40%. MHCDiff is robust to the occlusion ratio, showing the best performance.
  • Figure 4: Qualitative results on CAPE dataset. We evaluate our method with SMPL estimation method and implicit-function-based methods. Given the upper image, PaMIR, ICON, and HiLo cannot generate the occluded regions. They cannot also handle the misaligned SMPL mesh on the arms, creating incomplete bodies. ProPose predicts the full-body shape, but cannot capture the details like the blazer of the lower image. However, MHCDiff is robust to the occlusion and misalignment, and can capture pixel-aligned details.
  • Figure 5: Qualitative results on in-the-wild images. Two images on the left show occlusions due to interactions, and the rightmost image shows loose clothes. From internet photos, we use kirillov2023segany to segment images.
  • ...and 2 more figures