Table of Contents
Fetching ...

Utilizing Uncertainty in 2D Pose Detectors for Probabilistic 3D Human Mesh Recovery

Tom Wehrbein, Marco Rudolph, Bodo Rosenhahn, Bastian Wandt

TL;DR

The normalizing flow-based approach predicts plausible 3D human mesh hypotheses that are consistent with the image evidence while maintaining high diversity for ambiguous body parts and outperform other state-of-the-art probabilistic methods.

Abstract

Monocular 3D human pose and shape estimation is an inherently ill-posed problem due to depth ambiguities, occlusions, and truncations. Recent probabilistic approaches learn a distribution over plausible 3D human meshes by maximizing the likelihood of the ground-truth pose given an image. We show that this objective function alone is not sufficient to best capture the full distributions. Instead, we propose to additionally supervise the learned distributions by minimizing the distance to distributions encoded in heatmaps of a 2D pose detector. Moreover, we reveal that current methods often generate incorrect hypotheses for invisible joints which is not detected by the evaluation protocols. We demonstrate that person segmentation masks can be utilized during training to significantly decrease the number of invalid samples and introduce two metrics to evaluate it. Our normalizing flow-based approach predicts plausible 3D human mesh hypotheses that are consistent with the image evidence while maintaining high diversity for ambiguous body parts. Experiments on 3DPW and EMDB show that we outperform other state-of-the-art probabilistic methods. Code is available for research purposes at https://github.com/twehrbein/humr.

Utilizing Uncertainty in 2D Pose Detectors for Probabilistic 3D Human Mesh Recovery

TL;DR

The normalizing flow-based approach predicts plausible 3D human mesh hypotheses that are consistent with the image evidence while maintaining high diversity for ambiguous body parts and outperform other state-of-the-art probabilistic methods.

Abstract

Monocular 3D human pose and shape estimation is an inherently ill-posed problem due to depth ambiguities, occlusions, and truncations. Recent probabilistic approaches learn a distribution over plausible 3D human meshes by maximizing the likelihood of the ground-truth pose given an image. We show that this objective function alone is not sufficient to best capture the full distributions. Instead, we propose to additionally supervise the learned distributions by minimizing the distance to distributions encoded in heatmaps of a 2D pose detector. Moreover, we reveal that current methods often generate incorrect hypotheses for invisible joints which is not detected by the evaluation protocols. We demonstrate that person segmentation masks can be utilized during training to significantly decrease the number of invalid samples and introduce two metrics to evaluate it. Our normalizing flow-based approach predicts plausible 3D human mesh hypotheses that are consistent with the image evidence while maintaining high diversity for ambiguous body parts. Experiments on 3DPW and EMDB show that we outperform other state-of-the-art probabilistic methods. Code is available for research purposes at https://github.com/twehrbein/humr.

Paper Structure

This paper contains 29 sections, 8 equations, 12 figures, 5 tables.

Figures (12)

  • Figure 1: Our method models the full posterior distribution of plausible 3D human meshes given an RGB image. By utilizing heatmaps of a 2D pose detector, the learned distributions have more meaningful diversity and are more accurate than distributions predicted by ProHMR kolotouros2021prohmr. Two mesh hypotheses and the projection of 100 right wrist samples are shown. $\star$ is the ground-truth 2D position of the right wrist.
  • Figure 2: Multi-hypothesis 3D human pose estimation methods (e.g. ProHMR kolotouros2021prohmr) often generate implausible hypotheses, with joints visible that should be invisible. We significantly reduce the number of incorrect hypotheses by utilizing segmentation masks during training. $\star$ is the ground-truth 2D position of the left wrist.
  • Figure 3: Overview of our approach. Given an image $I$, we model the full posterior distribution of plausible 3D human meshes using a normalizing flow. In addition to maximizing the likelihood of the ground-truth pose, we supervise the learned distributions by minimizing the Maximum Mean Discrepancy between heatmap samples and projections of 3D mesh hypotheses generated by our NF. Furthermore, a segmentation mask loss is used to penalize invalid hypotheses. Body shape $\bm{\beta}$ and camera parameters $\bm{\pi}_w$ are estimated deterministically.
  • Figure 4: Visualization of the person mask loss $\mathcal{L}_{\text{mask}}$ for the left wrist of the person in the center. We explicitly penalize hypotheses for invisible joints that lie outside the person masks by minimizing the $l_1$ distance to the closest corresponding heatmap samples. Plausible hypotheses not penalized by $\mathcal{L}_{\text{mask}}$ are shown as green dots, implausible ones as green triangles and heatmap samples as blue squares. Best viewed with zoom and in color.
  • Figure 5: Qualitative results for challenging in-the-wild images with significant occlusions or truncations of body parts. Four samples from the learned 3D human mesh distribution are shown together with the 2D projections of 100 hypotheses for a highly ambiguous joint.
  • ...and 7 more figures