Table of Contents
Fetching ...

3D Human Mesh Estimation from Virtual Markers

Xiaoxuan Ma, Jiajun Su, Chunyu Wang, Wentao Zhu, Yizhou Wang

TL;DR

This work tackles 3D human mesh estimation from monocular images by addressing the loss of body-shape information in skeleton-based intermediates. It introduces virtual markers, a learnable set of $K=64$ markers learned via archetypal analysis from mocap data, enabling reconstruction of full meshes through $M = P A$ after estimating 3D marker positions $P$ from volumetric heatmaps and updating the interpolation matrix $A$ with marker confidences ($M = P A$). The model is trained with a combination of losses including $L_{vm}$, $L_{conf}$, and $L_{mesh}$ (comprising vertex, pose, normal, and edge terms) and benefits from mix-training across diverse datasets. Empirically, the method achieves state-of-the-art performance on H3.6M, 3DPW, and SURREAL, reducing shape ambiguities and handling occlusion more robustly than skeleton- or full-vertex-based approaches, with practical implications for wild-image mocap and realistic avatar generation.

Abstract

Inspired by the success of volumetric 3D pose estimation, some recent human mesh estimators propose to estimate 3D skeletons as intermediate representations, from which, the dense 3D meshes are regressed by exploiting the mesh topology. However, body shape information is lost in extracting skeletons, leading to mediocre performance. The advanced motion capture systems solve the problem by placing dense physical markers on the body surface, which allows to extract realistic meshes from their non-rigid motions. However, they cannot be applied to wild images without markers. In this work, we present an intermediate representation, named virtual markers, which learns 64 landmark keypoints on the body surface based on the large-scale mocap data in a generative style, mimicking the effects of physical markers. The virtual markers can be accurately detected from wild images and can reconstruct the intact meshes with realistic shapes by simple interpolation. Our approach outperforms the state-of-the-art methods on three datasets. In particular, it surpasses the existing methods by a notable margin on the SURREAL dataset, which has diverse body shapes. Code is available at https://github.com/ShirleyMaxx/VirtualMarker

3D Human Mesh Estimation from Virtual Markers

TL;DR

This work tackles 3D human mesh estimation from monocular images by addressing the loss of body-shape information in skeleton-based intermediates. It introduces virtual markers, a learnable set of markers learned via archetypal analysis from mocap data, enabling reconstruction of full meshes through after estimating 3D marker positions from volumetric heatmaps and updating the interpolation matrix with marker confidences (). The model is trained with a combination of losses including , , and (comprising vertex, pose, normal, and edge terms) and benefits from mix-training across diverse datasets. Empirically, the method achieves state-of-the-art performance on H3.6M, 3DPW, and SURREAL, reducing shape ambiguities and handling occlusion more robustly than skeleton- or full-vertex-based approaches, with practical implications for wild-image mocap and realistic avatar generation.

Abstract

Inspired by the success of volumetric 3D pose estimation, some recent human mesh estimators propose to estimate 3D skeletons as intermediate representations, from which, the dense 3D meshes are regressed by exploiting the mesh topology. However, body shape information is lost in extracting skeletons, leading to mediocre performance. The advanced motion capture systems solve the problem by placing dense physical markers on the body surface, which allows to extract realistic meshes from their non-rigid motions. However, they cannot be applied to wild images without markers. In this work, we present an intermediate representation, named virtual markers, which learns 64 landmark keypoints on the body surface based on the large-scale mocap data in a generative style, mimicking the effects of physical markers. The virtual markers can be accurately detected from wild images and can reconstruct the intact meshes with realistic shapes by simple interpolation. Our approach outperforms the state-of-the-art methods on three datasets. In particular, it surpasses the existing methods by a notable margin on the SURREAL dataset, which has diverse body shapes. Code is available at https://github.com/ShirleyMaxx/VirtualMarker
Paper Structure (24 sections, 7 equations, 14 figures, 10 tables)

This paper contains 24 sections, 7 equations, 14 figures, 10 tables.

Figures (14)

  • Figure 1: Mesh estimation results on four examples with different body shapes. Pose2Mesh choi2020pose2mesh which uses 3D skeletons as the intermediate representation fails to predict accurate shapes. Our virtual marker-based method obtains accurate estimates.
  • Figure 2: Left: The learned virtual markers (blue balls) in the back and front views. The grey balls mean they are invisible in the front view. The virtual markers act similarly to physical body markers and approximately outline the body shape. Right: Mesh estimation results by our approach, from left to right are input image, estimated 3D mesh overlayed on the image, and three different viewpoints showing the estimated 3D mesh with our intermediate predicted virtual markers (blue balls), respectively.
  • Figure 3: Overview of our framework. Given an input image $\mathbf{I}$, it first estimates the 3D positions $\hat{\mathbf{P}}$ of the virtual markers. Then we update the coefficient matrix $\hat{\mathbf{A}}$ based on the estimation confidence scores $\mathbf{C}$ of the virtual markers. Finally, the complete human mesh can be simply recovered by linear multiplication $\hat{\mathbf{M}} = \hat{\mathbf{P}}\hat{\mathbf{A}}$.
  • Figure 4: Mesh estimation results of different methods on H3.6M test set. Our method with virtual marker representation gets better shape estimation results than Pose2Mesh which uses skeleton representation. Note the waistline of the body and the thickness of the arm.
  • Figure 5: Visualization of the learned virtual markers of different numbers of $K = 16, 32, 96$, from left to right, respectively.
  • ...and 9 more figures