Table of Contents
Fetching ...

Personalized 3D Human Pose and Shape Refinement

Tom Wehrbein, Bodo Rosenhahn, Iain Matthews, Carsten Stoll

TL;DR

This work targets the persistent misalignment in regression-based 3D human pose and shape estimation by learning dense per-pixel 2D displacement fields that relate renderings of an initial SMPL mesh to the observed image. The displacement fields are generated by a CNN that processes multi-modal renderings (RGB, depth, normals, and unique vertex colors) and image features, then converted to per-vertex displacements via barycentric interpolation to drive refinement through a dense 2D reprojection loss within SMPLify. The approach is compatible with multiple base regressors and priors, and across 3DPW and RICH it yields improved image-model alignment and 3D accuracy, often outperforming OpenPose and DensePose-based refinements. Ablation studies show the benefits of including texture, depth, and normal information, while also highlighting robustness to texture noise and illumination changes, with limitations in clothing changes and SMPL's expressiveness for loose garments.

Abstract

Recently, regression-based methods have dominated the field of 3D human pose and shape estimation. Despite their promising results, a common issue is the misalignment between predictions and image observations, often caused by minor joint rotation errors that accumulate along the kinematic chain. To address this issue, we propose to construct dense correspondences between initial human model estimates and the corresponding images that can be used to refine the initial predictions. To this end, we utilize renderings of the 3D models to predict per-pixel 2D displacements between the synthetic renderings and the RGB images. This allows us to effectively integrate and exploit appearance information of the persons. Our per-pixel displacements can be efficiently transformed to per-visible-vertex displacements and then used for 3D model refinement by minimizing a reprojection loss. To demonstrate the effectiveness of our approach, we refine the initial 3D human mesh predictions of multiple models using different refinement procedures on 3DPW and RICH. We show that our approach not only consistently leads to better image-model alignment, but also to improved 3D accuracy.

Personalized 3D Human Pose and Shape Refinement

TL;DR

This work targets the persistent misalignment in regression-based 3D human pose and shape estimation by learning dense per-pixel 2D displacement fields that relate renderings of an initial SMPL mesh to the observed image. The displacement fields are generated by a CNN that processes multi-modal renderings (RGB, depth, normals, and unique vertex colors) and image features, then converted to per-vertex displacements via barycentric interpolation to drive refinement through a dense 2D reprojection loss within SMPLify. The approach is compatible with multiple base regressors and priors, and across 3DPW and RICH it yields improved image-model alignment and 3D accuracy, often outperforming OpenPose and DensePose-based refinements. Ablation studies show the benefits of including texture, depth, and normal information, while also highlighting robustness to texture noise and illumination changes, with limitations in clothing changes and SMPL's expressiveness for loose garments.

Abstract

Recently, regression-based methods have dominated the field of 3D human pose and shape estimation. Despite their promising results, a common issue is the misalignment between predictions and image observations, often caused by minor joint rotation errors that accumulate along the kinematic chain. To address this issue, we propose to construct dense correspondences between initial human model estimates and the corresponding images that can be used to refine the initial predictions. To this end, we utilize renderings of the 3D models to predict per-pixel 2D displacements between the synthetic renderings and the RGB images. This allows us to effectively integrate and exploit appearance information of the persons. Our per-pixel displacements can be efficiently transformed to per-visible-vertex displacements and then used for 3D model refinement by minimizing a reprojection loss. To demonstrate the effectiveness of our approach, we refine the initial 3D human mesh predictions of multiple models using different refinement procedures on 3DPW and RICH. We show that our approach not only consistently leads to better image-model alignment, but also to improved 3D accuracy.
Paper Structure (20 sections, 7 equations, 12 figures, 3 tables)

This paper contains 20 sections, 7 equations, 12 figures, 3 tables.

Figures (12)

  • Figure 1: Given an initial 3D human model estimate, we predict per-pixel 2D displacements between renderings of the 3D model and the given image that we subsequently use to refine the initial prediction. For clarity, only a sparse subset of displacement vectors is shown. Image is taken from RICH huang2022rich.
  • Figure 2: Overview of our proposed approach. Given an image with estimated 3D human mesh, camera parameters $\bm{\mathrm{\pi}}$ and an approximate texture map of the target person, we predict the per-pixel 2D displacement field between the 3D human model renderings and the image. The per-pixel 2D displacements are transformed to per-visible-vertex displacements and can subsequently be used to refine the 3D human model using e.g. SMPLify Bogo2016ECCV. For clarity, only a sparse subset of displacement vectors is shown. Reference image from 3DPW marcard2018eccv.
  • Figure 3: Examples of reconstructed texture maps from Human3.6M, 3DPW and RICH used for training and evaluation. We generate texture maps by back-projecting the image colors from multiple frames to all visible vertices.
  • Figure 4: Qualitative results on RICH huang2022rich and 3DPW marcard2018eccv. From left to right: input images, initial body estimates, our predicted displacement fields, our refined 3D human models and side views of initial, refined and ground-truth bodies.
  • Figure 5: Qualitative comparison on RICH huang2022rich. We compare our refined 3D human models (red) with refinements using OpenPose keypoints (green) and the ground-truth bodies (magenta). Best viewed with zoom and in color.
  • ...and 7 more figures