Table of Contents
Fetching ...

Leveraging 2D Masked Reconstruction for Domain Adaptation of 3D Pose Estimation

Hansoo Park, Chanwoo Kim, Jihyeon Kim, Hoseong Cho, Nhat Nguyen Bao Truong, Taehwan Kim, Seungryul Baek

TL;DR

The paper tackles domain shift in RGB-based 3D pose estimation by introducing an unsupervised domain adaptation framework that leverages unlabeled target data through Masked Image Modeling ($\text{MIM}$). A foreground-centric reconstruction term and an attention regularization mechanism are integrated into a two-stage pipeline (MAE-based pre-training and target-aware fine-tuning) to align source and target representations while preserving target-domain information. Empirical results across 3D hand and human pose benchmarks show state-of-the-art performance in cross-domain settings, with ablations confirming the effectiveness of foreground focus, segmentation-guided augmentation, and attention regularization. The approach demonstrates robust, data-efficient domain adaptation, enabling more reliable 3D pose estimation in diverse real-world scenarios without requiring target-domain labels.

Abstract

RGB-based 3D pose estimation methods have been successful with the development of deep learning and the emergence of high-quality 3D pose datasets. However, most existing methods do not operate well for testing images whose distribution is far from that of training data. However, most existing methods do not operate well for testing images whose distribution is far from that of training data. This problem might be alleviated by involving diverse data during training, however it is non-trivial to collect such diverse data with corresponding labels (i.e. 3D pose). In this paper, we introduced an unsupervised domain adaptation framework for 3D pose estimation that utilizes the unlabeled data in addition to labeled data via masked image modeling (MIM) framework. Foreground-centric reconstruction and attention regularization are further proposed to increase the effectiveness of unlabeled data usage. Experiments are conducted on the various datasets in human and hand pose estimation tasks, especially using the cross-domain scenario. We demonstrated the effectiveness of ours by achieving the state-of-the-art accuracy on all datasets.

Leveraging 2D Masked Reconstruction for Domain Adaptation of 3D Pose Estimation

TL;DR

The paper tackles domain shift in RGB-based 3D pose estimation by introducing an unsupervised domain adaptation framework that leverages unlabeled target data through Masked Image Modeling (). A foreground-centric reconstruction term and an attention regularization mechanism are integrated into a two-stage pipeline (MAE-based pre-training and target-aware fine-tuning) to align source and target representations while preserving target-domain information. Empirical results across 3D hand and human pose benchmarks show state-of-the-art performance in cross-domain settings, with ablations confirming the effectiveness of foreground focus, segmentation-guided augmentation, and attention regularization. The approach demonstrates robust, data-efficient domain adaptation, enabling more reliable 3D pose estimation in diverse real-world scenarios without requiring target-domain labels.

Abstract

RGB-based 3D pose estimation methods have been successful with the development of deep learning and the emergence of high-quality 3D pose datasets. However, most existing methods do not operate well for testing images whose distribution is far from that of training data. However, most existing methods do not operate well for testing images whose distribution is far from that of training data. This problem might be alleviated by involving diverse data during training, however it is non-trivial to collect such diverse data with corresponding labels (i.e. 3D pose). In this paper, we introduced an unsupervised domain adaptation framework for 3D pose estimation that utilizes the unlabeled data in addition to labeled data via masked image modeling (MIM) framework. Foreground-centric reconstruction and attention regularization are further proposed to increase the effectiveness of unlabeled data usage. Experiments are conducted on the various datasets in human and hand pose estimation tasks, especially using the cross-domain scenario. We demonstrated the effectiveness of ours by achieving the state-of-the-art accuracy on all datasets.
Paper Structure (18 sections, 7 equations, 7 figures, 9 tables)

This paper contains 18 sections, 7 equations, 7 figures, 9 tables.

Figures (7)

  • Figure 1: Example results of pose estimation on different domains from training dataset. The second and third columns denote the result of the baseline and our framework, respectively. The baseline denotes the pose estimator trained with source domain dataset in a supervised manner. The baseline provides a low quality of poses on different domains, while ours provides more accurate results.
  • Figure 2: Schematic diagram of the overall framework. Our framework consists of two stages. In pre-training stage, we perform augmentation and masking on the input image. The encoder $f^\text{E}$ extracts the latent representation $\mathbf{F}_\text{MAE}$ from the image. Then, the decoder $f^\text{D}$ reconstructs the image $\hat{\mathbf{x}}$ with the $\mathbf{F}^\text{D}_\text{MAE}$. The reconstructed image $\hat{\mathbf{x}}$ is used to compute the loss $L_\text{WMAE}$ with the corresponding image $\mathbf{x}$ and segmentation mask $s$. In fine-tuning stage, the encoder $f^\text{E}$ extracts the latent representation $\mathbf{F}_\text{img}$ from the source image $\mathbf{x}_\text{s}$. Then, the keypoint head $f^\text{H}$ estimates the heatmap $\hat{\mathbf{H}}$ with $\mathbf{F}_\text{img}$. Additionally, we obtain the attention map $\bar{\mathbf{a}}^\text{E}$ and $\bar{\mathbf{a}}^\text{A}$ from $f^\text{E}$ and $f^{\text{A}}$, respectively. We use the estimated heatmap $\hat{\mathbf{H}}$ and ground-truth heatmap $\mathbf{H}$ for calculating the $L_\text{kpt}$ and attention map, $\bar{\mathbf{a}}^\text{E}$ and $\bar{\mathbf{a}}^\text{A}$ for $L_\text{attn}$.
  • Figure 3: Qualitative comparisons of scratch and ours on (first row) STB mueller2017stb, (second row) RHD zimmerman_iccv2017, (third row) Panoptic (PAN) joo2018total, and (last row) Ganerated (GAN) mueller2018ganerated datasets. We involve the results of 2D pose, 3D pose and attention map for scratch and ours, respectively.
  • Figure 4: Qualitative comparisons of scratch and ours on (first row) 3DPW von2018recovering, (second row) MPI-INF-3DHP (3DHP) mehta2017monocular, and (last row) SURREAL varol2017learning datasets. We involve the results of 2D pose, 3D pose and attention map for scratch and ours, respectively.
  • Figure 5: Qualitative comparisons of attention map for target domain data $X_T$.
  • ...and 2 more figures