Personalized 3D Human Pose and Shape Refinement
Tom Wehrbein, Bodo Rosenhahn, Iain Matthews, Carsten Stoll
TL;DR
This work targets the persistent misalignment in regression-based 3D human pose and shape estimation by learning dense per-pixel 2D displacement fields that relate renderings of an initial SMPL mesh to the observed image. The displacement fields are generated by a CNN that processes multi-modal renderings (RGB, depth, normals, and unique vertex colors) and image features, then converted to per-vertex displacements via barycentric interpolation to drive refinement through a dense 2D reprojection loss within SMPLify. The approach is compatible with multiple base regressors and priors, and across 3DPW and RICH it yields improved image-model alignment and 3D accuracy, often outperforming OpenPose and DensePose-based refinements. Ablation studies show the benefits of including texture, depth, and normal information, while also highlighting robustness to texture noise and illumination changes, with limitations in clothing changes and SMPL's expressiveness for loose garments.
Abstract
Recently, regression-based methods have dominated the field of 3D human pose and shape estimation. Despite their promising results, a common issue is the misalignment between predictions and image observations, often caused by minor joint rotation errors that accumulate along the kinematic chain. To address this issue, we propose to construct dense correspondences between initial human model estimates and the corresponding images that can be used to refine the initial predictions. To this end, we utilize renderings of the 3D models to predict per-pixel 2D displacements between the synthetic renderings and the RGB images. This allows us to effectively integrate and exploit appearance information of the persons. Our per-pixel displacements can be efficiently transformed to per-visible-vertex displacements and then used for 3D model refinement by minimizing a reprojection loss. To demonstrate the effectiveness of our approach, we refine the initial 3D human mesh predictions of multiple models using different refinement procedures on 3DPW and RICH. We show that our approach not only consistently leads to better image-model alignment, but also to improved 3D accuracy.
