W-HMR: Monocular Human Mesh Recovery in World Space with Weak-Supervised Calibration

Wei Yao; Hongwen Zhang; Yunlian Sun; Yebin Liu; Jinhui Tang

W-HMR: Monocular Human Mesh Recovery in World Space with Weak-Supervised Calibration

Wei Yao, Hongwen Zhang, Yunlian Sun, Yebin Liu, Jinhui Tang

TL;DR

W-HMR tackles monocular 3D human motion recovery by decoupling camera calibration from world-space pose estimation. It introduces a weakly supervised focal-length predictor coupled with FULL$^2$ 2D supervision and an OrientCorrect module to produce plausible world-space poses without relying on precise focal-length labels or camera rotations. The approach uses a three-stage training paradigm and a hybrid regression architecture to deliver accurate reconstructions in both camera and world coordinates, outperforming state-of-the-art methods on distorted datasets and maintaining competitiveness on standard benchmarks. This work enables robust real-world human motion capture from monocular imagery with improved generalizability and practical applicability, and code is publicly available at the project page.

Abstract

Previous methods for 3D human motion recovery from monocular images often fall short due to reliance on camera coordinates, leading to inaccuracies in real-world applications. The limited availability and diversity of focal length labels further exacerbate misalignment issues in reconstructed 3D human bodies. To address these challenges, we introduce W-HMR, a weak-supervised calibration method that predicts "reasonable" focal lengths based on body distortion information, eliminating the need for precise focal length labels. Our approach enhances 2D supervision precision and recovery accuracy. Additionally, we present the OrientCorrect module, which corrects body orientation for plausible reconstructions in world space, avoiding the error accumulation associated with inaccurate camera rotation predictions. Our contributions include a novel weak-supervised camera calibration technique, an effective orientation correction module, and a decoupling strategy that significantly improves the generalizability and accuracy of human motion recovery in both camera and world coordinates. The robustness of W-HMR is validated through extensive experiments on various datasets, showcasing its superiority over existing methods. Codes and demos have been made available on the project page https://yw0208.github.io/w-hmr/.

W-HMR: Monocular Human Mesh Recovery in World Space with Weak-Supervised Calibration

TL;DR

W-HMR tackles monocular 3D human motion recovery by decoupling camera calibration from world-space pose estimation. It introduces a weakly supervised focal-length predictor coupled with FULL

2D supervision and an OrientCorrect module to produce plausible world-space poses without relying on precise focal-length labels or camera rotations. The approach uses a three-stage training paradigm and a hybrid regression architecture to deliver accurate reconstructions in both camera and world coordinates, outperforming state-of-the-art methods on distorted datasets and maintaining competitiveness on standard benchmarks. This work enables robust real-world human motion capture from monocular imagery with improved generalizability and practical applicability, and code is publicly available at the project page.

Abstract

Paper Structure (19 sections, 14 equations, 9 figures, 9 tables)

This paper contains 19 sections, 14 equations, 9 figures, 9 tables.

Introduction
Related Work
Camera Model in Human Recovery
Regression-based Method
Method
Preliminary
Weak-Supervised Camera Calibration
Orientation Correction
Training Paradigm
Other Losses
Experiment
Implementation Details
About Metrics
Evaluation Results
Ablation Study
...and 4 more sections

Figures (9)

Figure 1: Given input images, we show recovered motion output by traditional methods based on the camera coordinate and our W-HMR based on the world coordinate. In contrast to traditional methods, W-HMR is capable of rectifying incorrect poses in the camera coordinate and guaranteeing the rationality of poses in the world space.
Figure 2: Comparison among the camera models used by the other four models hmrwang2023zollyspeccliff and our W-HMR. $f_{crop}$ and $f_{full}$ denote the focal lengths of the full and cropped images, respectively. $R_c$ is the camera rotation matrix, and $I$ is the identity matrix. $s$ means scale. $h_{bbox}$ and $h$ are the heights of the bounding box. But $h_{bbox}$ is dynamic, and $h$ is a constant value. $t_{b}^{z}$ refers to the vertical distance from the human body to the plane where the camera is located. The red items mean trainable, i.e., predicted by neural networks.
Figure 3: Pipeline of W-HMR. A PyMAF-like backbone is adopted to extract features and predict the recovery results in the camera coordinate. “glob info" refers to global information consisting of five elements. $(\varDelta c_x,\,\varDelta c_y)$ are offsets of the bounding box center relative to the center of the original image. FULL$^2$ 2D supervision refers to 2D joint supervision on full images, which is based on full-perspective camera model. The CamClib takes the whole image as input and outputs camera rotation matrices. The OrientCorrect takes three vectors as input and outputs body orientation in the world coordinate. $\{s,\,t_x,\,t_y\}$ are scales and translations in the camera coordinate. $f$ is focal length and $(t_{b}^{x},\,t_{b}^{y},\,t_{b}^{z})$ is the translation to camera in the world coordinate. Note $\varPhi_3$ and $\phi_3^s$ are features extracted by the PyMAF-like backbone. If you are confused about our feature extraction, please refer to Sec. \ref{['sec:implement']} for our network details.
Figure 4: That image on the left is the original image. The middle shows the cropped images used in the traditional method at the top. The predicted focal length, body orientation, and shooting angle in traditional methods are shown at the bottom of the middle image. The image on the right shows the actual shot, illustrating that cropping leads to the loss of necessary information. This loss of necessary information ultimately affects the reconstruction results.
Figure 5: This figure depicts our model training paradigm, illustrating the detachment strategy at different stages to achieve stable model training. The symbol $\otimes$ represents detachment. Traditional 2D supervision refers to projecting 3D joints onto the cropped image to get 2D joints based on the weak-perspective camera model, just as traditional models did. The “freeze" means other modules are frozen in the third stage, and only OrientCorrect is updated.
...and 4 more figures

W-HMR: Monocular Human Mesh Recovery in World Space with Weak-Supervised Calibration

TL;DR

Abstract

W-HMR: Monocular Human Mesh Recovery in World Space with Weak-Supervised Calibration

Authors

TL;DR

Abstract

Table of Contents

Figures (9)