Multi-RoI Human Mesh Recovery with Camera Consistency and Contrastive Losses

Yongwei Nie; Changzhen Liu; Chengjiang Long; Qing Zhang; Guiqing Li; Hongmin Cai

Multi-RoI Human Mesh Recovery with Camera Consistency and Contrastive Losses

Yongwei Nie, Changzhen Liu, Chengjiang Long, Qing Zhang, Guiqing Li, Hongmin Cai

TL;DR

The paper addresses camera–mesh entanglement in single-image Human Mesh Recovery by introducing multiple RoIs to generate local cameras and enforce a single full-image camera through a camera consistency loss $L_{cam}$ and a contrastive loss $L_{cont}$. A RoI-aware feature fusion network outputs a mesh shared by all RoIs and local cameras for each RoI, with local cameras convertible to the full camera to enable cross-RoI constraints and a latent space projection for $L_{cont}$. The method achieves state-of-the-art performance on benchmarks such as 3DPW and Human3.6M and is validated through extensive ablations showing the effectiveness of the RoI fusion, camera consistency, and contrastive components. This approach improves mesh accuracy and camera estimation in HMR and points to potential extensions to multi-view or video-based settings.

Abstract

Besides a 3D mesh, Human Mesh Recovery (HMR) methods usually need to estimate a camera for computing 2D reprojection loss. Previous approaches may encounter the following problem: both the mesh and camera are not correct but the combination of them can yield a low reprojection loss. To alleviate this problem, we define multiple RoIs (region of interest) containing the same human and propose a multiple-RoI-based HMR method. Our key idea is that with multiple RoIs as input, we can estimate multiple local cameras and have the opportunity to design and apply additional constraints between cameras to improve the accuracy of the cameras and, in turn, the accuracy of the corresponding 3D mesh. To implement this idea, we propose a RoI-aware feature fusion network by which we estimate a 3D mesh shared by all RoIs as well as local cameras corresponding to the RoIs. We observe that local cameras can be converted to the camera of the full image through which we construct a local camera consistency loss as the additional constraint imposed on local cameras. Another benefit of introducing multiple RoIs is that we can encapsulate our network into a contrastive learning framework and apply a contrastive loss to regularize the training of our network. Experiments demonstrate the effectiveness of our multi-RoI HMR method and superiority to recent prior arts. Our code is available at https://github.com/CptDiaos/Multi-RoI.

Multi-RoI Human Mesh Recovery with Camera Consistency and Contrastive Losses

TL;DR

and a contrastive loss

. A RoI-aware feature fusion network outputs a mesh shared by all RoIs and local cameras for each RoI, with local cameras convertible to the full camera to enable cross-RoI constraints and a latent space projection for

. The method achieves state-of-the-art performance on benchmarks such as 3DPW and Human3.6M and is validated through extensive ablations showing the effectiveness of the RoI fusion, camera consistency, and contrastive components. This approach improves mesh accuracy and camera estimation in HMR and points to potential extensions to multi-view or video-based settings.

Abstract

Paper Structure (15 sections, 19 equations, 9 figures, 4 tables)

This paper contains 15 sections, 19 equations, 9 figures, 4 tables.

Introduction
Related Work
Method
RoI-aware Feature Fusion Network
Camera Consistency Loss
Contrastive Loss
Total Training Loss
Extraction of RoIs
Experiments
Datasets and Metrics
Implementation Details
Comparison to Prior Arts
Ablation Study
Limitations
Conclusion and Future Work

Figures (9)

Figure 1: (a) Extracted RoI $i$ is fed to a regressor but it wrongly estimates a local camera which sees the mesh in -10$^\circ$ while the accurate local camera shall see it in 0$^\circ$. Consequently, when further converted to full camera, it will wrongly see the mesh in 20$^\circ$ instead of groundtruth 45$^\circ$. (b) As with RoI $j$, the full camera derived from incorrectly estimated local camera (30$^\circ$) sees the mesh in 55$^\circ$. Both (a) and (b) will mislead the 2D-projection loss to output incorrect 3D mesh due to the false projection. (c) We feed multiple RoIs into the network simultaneously and estimate local cameras of the RoIs. Both local cameras can be converted to the full camera from the perspective of which the 3D mesh should be aligned. We use this observation to establish pairwise consistency losses between local cameras to obtain accurate local cameras (0$^\circ$ and 15$^\circ$).
Figure 2: Overview of our method. Given an image, we extract multiple RoIs of a human, and use a RoI-aware feature fusion network to estimate the 3D mesh of the human together with cameras. We use a camera consistency loss and a contrastive loss to supervise the training of the network.
Figure 3: RoI-aware fusion. To obtain $\mathbf{u}_m$, we consider the relative relation of other boundingboxes to the $m^{th}$ boundingbox. We perform positional encoding to all the boundingboxes and then compute relative position relation $\gamma_{m*}$ (where $*$ is a number in $[1,M]$). We then concatenate $\gamma_{m*}$ and the corresponding feature $\mathbf{h}_*$ to compute weight $w_{m*}$. Finally, $\mathbf{u}_m$ is the weighted sum of $\{\mathbf{h}_m\}_{m=1}^M$ with $w_{m*}$ as the weights.
Figure 4: Conversion between local and full cameras in bird's eye view.
Figure 5: Contrastive Loss. Taking RoIs $\{\mathbf{X}^i_m|m\in[1,M]\}$ of object $i$ and RoIs $\{\mathbf{X}^j_m|m\in[1,M]\}$ of object $j$ as example, features $\{\mathbf{h}^i_m|m\in[1,M]\}$ and $\{\mathbf{h}^j_m|m\in[1,M]\}$ are first extracted by the shared backbone $E$ from the RoIs, respectively. Then the features are further projected into the latent space $\mathbf{z}$, obtaining $\{\mathbf{z}^i_m|m\in[1,M]\}$ and $\{\mathbf{z}^j_m|m\in[1,M]\}$. The latent features from the same object attract each other, while latent features from different objects repel each other.
...and 4 more figures

Multi-RoI Human Mesh Recovery with Camera Consistency and Contrastive Losses

TL;DR

Abstract

Multi-RoI Human Mesh Recovery with Camera Consistency and Contrastive Losses

Authors

TL;DR

Abstract

Table of Contents

Figures (9)