Table of Contents
Fetching ...

MoCap-to-Visual Domain Adaptation for Efficient Human Mesh Estimation from 2D Keypoints

Bedirhan Uguz, Ozhan Suat, Batuhan Karagoz, Emre Akbas

TL;DR

Key2Mesh, a model that takes a set of 2D human pose keypoints as input and estimates the corresponding body mesh, sets the new state-of-the-art by outperforming other models in PA-MPJPE for both datasets, and in MPJPE and PVE for the 3DPW dataset.

Abstract

This paper presents Key2Mesh, a model that takes a set of 2D human pose keypoints as input and estimates the corresponding body mesh. Since this process does not involve any visual (i.e. RGB image) data, the model can be trained on large-scale motion capture (MoCap) datasets, thereby overcoming the scarcity of image datasets with 3D labels. To enable the model's application on RGB images, we first run an off-the-shelf 2D pose estimator to obtain the 2D keypoints, and then feed these 2D keypoints to Key2Mesh. To improve the performance of our model on RGB images, we apply an adversarial domain adaptation (DA) method to bridge the gap between the MoCap and visual domains. Crucially, our DA method does not require 3D labels for visual data, which enables adaptation to target sets without the need for costly labels. We evaluate Key2Mesh for the task of estimating 3D human meshes from 2D keypoints, in the absence of RGB and mesh label pairs. Our results on widely used H3.6M and 3DPW datasets show that Key2Mesh sets the new state-of-the-art by outperforming other models in PA-MPJPE for both datasets, and in MPJPE and PVE for the 3DPW dataset. Thanks to our model's simple architecture, it operates at least 12x faster than the prior state-of-the-art model, LGD. Additional qualitative samples and code are available on the project website: https://key2mesh.github.io/.

MoCap-to-Visual Domain Adaptation for Efficient Human Mesh Estimation from 2D Keypoints

TL;DR

Key2Mesh, a model that takes a set of 2D human pose keypoints as input and estimates the corresponding body mesh, sets the new state-of-the-art by outperforming other models in PA-MPJPE for both datasets, and in MPJPE and PVE for the 3DPW dataset.

Abstract

This paper presents Key2Mesh, a model that takes a set of 2D human pose keypoints as input and estimates the corresponding body mesh. Since this process does not involve any visual (i.e. RGB image) data, the model can be trained on large-scale motion capture (MoCap) datasets, thereby overcoming the scarcity of image datasets with 3D labels. To enable the model's application on RGB images, we first run an off-the-shelf 2D pose estimator to obtain the 2D keypoints, and then feed these 2D keypoints to Key2Mesh. To improve the performance of our model on RGB images, we apply an adversarial domain adaptation (DA) method to bridge the gap between the MoCap and visual domains. Crucially, our DA method does not require 3D labels for visual data, which enables adaptation to target sets without the need for costly labels. We evaluate Key2Mesh for the task of estimating 3D human meshes from 2D keypoints, in the absence of RGB and mesh label pairs. Our results on widely used H3.6M and 3DPW datasets show that Key2Mesh sets the new state-of-the-art by outperforming other models in PA-MPJPE for both datasets, and in MPJPE and PVE for the 3DPW dataset. Thanks to our model's simple architecture, it operates at least 12x faster than the prior state-of-the-art model, LGD. Additional qualitative samples and code are available on the project website: https://key2mesh.github.io/.
Paper Structure (29 sections, 9 equations, 8 figures, 6 tables)

This paper contains 29 sections, 9 equations, 8 figures, 6 tables.

Figures (8)

  • Figure 1: An overview of Key2Mesh's inference and training process. (Top) Inference: For a given input image, we run an off-the-shelf 2D pose estimator to obtain 2D keypoints, which are input to our Key2Mesh model to estimate a body mesh. (Bottom) Training: Source domain consists of 3D body meshes obtained from MoCap data. We generate 2D keypoints from these body meshes using virtual cameras and a range of augmentations. We then pre-train our model using these generated 2D keypoints and corresponding meshes, without incorporating any RGB images. Finally, we adapt our model to target RGB images by using an off-the-shelf 2D pose estimator and bridging the gap between the "2D keypoints obtained from MoCap’’ domain and the "2D keypoints obtained from the target RGB images’’ domain.
  • Figure 2: Overview of our pre-training and domain adaptation phases. First, we begin with "Pre-Training’’ phase where we train a feature extractor and an SMPL head using only an unpaired MoCap dataset. Second, we introduce a "Domain Adaptation’’ phase to close the gap between the source domain (MoCap) and the target domain (Visual) by using 2D detections obtained by an off-the-shelf 2D pose estimator in the target domain.
  • Figure 3: Qualitative comparison of our pre-trained and domain-adapted model on H3.6M dataset h36m_pami. The first column shows the input image and keypoint detections. The second and third columns display outputs from the pre-trained model, while the last two columns present results from the domain-adapted model, which outperforms our pre-trained model, especially on complex poses.
  • Figure 4: Qualitative results on the 3DPW dataset von2018recovering:3dpw. Each sample displays the input image and its corresponding 2D keypoint detections in the first column, the SMPL mesh overlaid on the image in the second column, and the side view of the SMPL mesh in the third column, all generated using a domain-adapted model.
  • Figure 5: Qualitative results on the H3.6M dataset h36m_pami. For each sample, the first column displays the input image and 2D keypoint detections, the second column shows the SMPL mesh overlaid on the image, and the third column presents the SMPL mesh from the side view. Domain-adapted model is used to generate these qualitative results.
  • ...and 3 more figures