Table of Contents
Fetching ...

CameraHMR: Aligning People with Perspective

Priyanka Patel, Michael J. Black

TL;DR

A field-of-view prediction model (HumanFoV) trained on a dataset of images containing people and the HMR2.0 architecture is upgraded to include the estimated camera parameters, leading to more accurate pGT and a new model, CameraHMR, with state-of-the-art accuracy.

Abstract

We address the challenge of accurate 3D human pose and shape estimation from monocular images. The key to accuracy and robustness lies in high-quality training data. Existing training datasets containing real images with pseudo ground truth (pGT) use SMPLify to fit SMPL to sparse 2D joint locations, assuming a simplified camera with default intrinsics. We make two contributions that improve pGT accuracy. First, to estimate camera intrinsics, we develop a field-of-view prediction model (HumanFoV) trained on a dataset of images containing people. We use the estimated intrinsics to enhance the 4D-Humans dataset by incorporating a full perspective camera model during SMPLify fitting. Second, 2D joints provide limited constraints on 3D body shape, resulting in average-looking bodies. To address this, we use the BEDLAM dataset to train a dense surface keypoint detector. We apply this detector to the 4D-Humans dataset and modify SMPLify to fit the detected keypoints, resulting in significantly more realistic body shapes. Finally, we upgrade the HMR2.0 architecture to include the estimated camera parameters. We iterate model training and SMPLify fitting initialized with the previously trained model. This leads to more accurate pGT and a new model, CameraHMR, with state-of-the-art accuracy. Code and pGT are available for research purposes.

CameraHMR: Aligning People with Perspective

TL;DR

A field-of-view prediction model (HumanFoV) trained on a dataset of images containing people and the HMR2.0 architecture is upgraded to include the estimated camera parameters, leading to more accurate pGT and a new model, CameraHMR, with state-of-the-art accuracy.

Abstract

We address the challenge of accurate 3D human pose and shape estimation from monocular images. The key to accuracy and robustness lies in high-quality training data. Existing training datasets containing real images with pseudo ground truth (pGT) use SMPLify to fit SMPL to sparse 2D joint locations, assuming a simplified camera with default intrinsics. We make two contributions that improve pGT accuracy. First, to estimate camera intrinsics, we develop a field-of-view prediction model (HumanFoV) trained on a dataset of images containing people. We use the estimated intrinsics to enhance the 4D-Humans dataset by incorporating a full perspective camera model during SMPLify fitting. Second, 2D joints provide limited constraints on 3D body shape, resulting in average-looking bodies. To address this, we use the BEDLAM dataset to train a dense surface keypoint detector. We apply this detector to the 4D-Humans dataset and modify SMPLify to fit the detected keypoints, resulting in significantly more realistic body shapes. Finally, we upgrade the HMR2.0 architecture to include the estimated camera parameters. We iterate model training and SMPLify fitting initialized with the previously trained model. This leads to more accurate pGT and a new model, CameraHMR, with state-of-the-art accuracy. Code and pGT are available for research purposes.

Paper Structure

This paper contains 21 sections, 10 equations, 8 figures, 5 tables.

Figures (8)

  • Figure 1: Putting people in perspective. In contrast to common methods like HMR2.0, CameraHMR estimates 3D human shape and pose using a perspective camera by leveraging a learned regressor, HumanFoV, to estimate the appropriate camera intrinsics. Note how this improves the estimated pose when there is strong foreshortening. Our approach exploits new pseudo ground-truth data and a new dense surface keypoint detector that improve body shape estimation; this is particularly visible for the heavier people in the images. CameraHMR defines the new state-of-the-art for 3D human pose and shape accuracy from a single image.
  • Figure 2: Pseudo-Ground-Truth (pGT) training data. Row 1: example images from the 4DHumans dataset. Rows 2 and 3: original pGT overlaid and viewed from a different perspective. Rows 4 and 5: our improved pGT using CamSMPLify. Note that our approach reduces the bias towards bent knees (columns 1, 5, 6), improves 3D pose and image alignment when there is foreshortening (Column 2, 4, 7, 9), and estimates more realistic body shape (columns 1, 3, 7, 8).
  • Figure 3: Overview of CamSMPLify: The DenseKP module processes cropped images to produce dense surface keypoints, while the HumanFoV module uses full images to estimate camera intrinsics. The output from these are used by CamSMPLify to optimize the SMPL model parameters, $\beta$, $\theta$, and the global translation $t^{\text{full}}$. Our iterative training strategy starts with initial estimates, $\mathrm{V}_{init}$ from CameraHMR, which are used to regularize the CamSMPLify estimates. CameraHMR is then iteratively refined based on the improved pGT from CamSMPLify.
  • Figure 4: Qualitative results of different baselines on LSP lsp and MPII mpii test images. CameraHMR achieves better 3D pose and shape reconstruction while also achieving more accurate 2D alignment compared to other SOTA methods trained on comparable datasets.
  • Figure 5: Focal Length distribution of images used in training HumanFoV.
  • ...and 3 more figures