Table of Contents
Fetching ...

Exploiting Aggregation and Segregation of Representations for Domain Adaptive Human Pose Estimation

Qucheng Peng, Ce Zheng, Zhengming Ding, Pu Wang, Chen Chen

TL;DR

This work tackles domain shift in 2D human pose estimation by disentangling representations into domain-invariant and domain-specific components and applying both aggregation and segregation during learning. A novel Intermediate Domain Framework (IDF) introduces explicit domain-specific heads to create intermediate representations and employs a multi-relational discrepancy loss based on Maximum Mean Discrepancy (MMD) across inter- and intra-hypothesis keypoints, formalized as $\\mathcal{L}_{dl}=\\mathcal{L}_{inter}-\\mathcal{L}_{spec}$ and $\\mathcal{L}_{inter}=\\mathcal{MMD}_{output}(\\boldsymbol{H^*},\\boldsymbol{H^*_{a}})$ with $L_{r1}$, $L_{r2}$, $L_{r3}$ defined over keypoint pairs. The training proceeds in three stages (warm-up on source, adversarial discrepancy maximization on target, and discrepancy minimization), combining $\\mathcal{L}_{mse}$, $\\mathcal{L}_{oks}$, and $\\mathcal{L}_{dl}$ to achieve robust transfer. Extensive experiments across hand and human pose adaptation benchmarks show state-of-the-art performance and strong generalization with efficient training. The approach is made available via code at the project site.

Abstract

Human pose estimation (HPE) has received increasing attention recently due to its wide application in motion analysis, virtual reality, healthcare, etc. However, it suffers from the lack of labeled diverse real-world datasets due to the time- and labor-intensive annotation. To cope with the label deficiency issue, one common solution is to train the HPE models with easily available synthetic datasets (source) and apply them to real-world data (target) through domain adaptation (DA). Unfortunately, prevailing domain adaptation techniques within the HPE domain remain predominantly fixated on effecting alignment and aggregation between source and target features, often sidestepping the crucial task of excluding domain-specific representations. To rectify this, we introduce a novel framework that capitalizes on both representation aggregation and segregation for domain adaptive human pose estimation. Within this framework, we address the network architecture aspect by disentangling representations into distinct domain-invariant and domain-specific components, facilitating aggregation of domain-invariant features while simultaneously segregating domain-specific ones. Moreover, we tackle the discrepancy measurement facet by delving into various keypoint relationships and applying separate aggregation or segregation mechanisms to enhance alignment. Extensive experiments on various benchmarks, e.g., Human3.6M, LSP, H3D, and FreiHand, show that our method consistently achieves state-of-the-art performance. The project is available at \url{https://github.com/davidpengucf/EPIC}.

Exploiting Aggregation and Segregation of Representations for Domain Adaptive Human Pose Estimation

TL;DR

This work tackles domain shift in 2D human pose estimation by disentangling representations into domain-invariant and domain-specific components and applying both aggregation and segregation during learning. A novel Intermediate Domain Framework (IDF) introduces explicit domain-specific heads to create intermediate representations and employs a multi-relational discrepancy loss based on Maximum Mean Discrepancy (MMD) across inter- and intra-hypothesis keypoints, formalized as and with , , defined over keypoint pairs. The training proceeds in three stages (warm-up on source, adversarial discrepancy maximization on target, and discrepancy minimization), combining , , and to achieve robust transfer. Extensive experiments across hand and human pose adaptation benchmarks show state-of-the-art performance and strong generalization with efficient training. The approach is made available via code at the project site.

Abstract

Human pose estimation (HPE) has received increasing attention recently due to its wide application in motion analysis, virtual reality, healthcare, etc. However, it suffers from the lack of labeled diverse real-world datasets due to the time- and labor-intensive annotation. To cope with the label deficiency issue, one common solution is to train the HPE models with easily available synthetic datasets (source) and apply them to real-world data (target) through domain adaptation (DA). Unfortunately, prevailing domain adaptation techniques within the HPE domain remain predominantly fixated on effecting alignment and aggregation between source and target features, often sidestepping the crucial task of excluding domain-specific representations. To rectify this, we introduce a novel framework that capitalizes on both representation aggregation and segregation for domain adaptive human pose estimation. Within this framework, we address the network architecture aspect by disentangling representations into distinct domain-invariant and domain-specific components, facilitating aggregation of domain-invariant features while simultaneously segregating domain-specific ones. Moreover, we tackle the discrepancy measurement facet by delving into various keypoint relationships and applying separate aggregation or segregation mechanisms to enhance alignment. Extensive experiments on various benchmarks, e.g., Human3.6M, LSP, H3D, and FreiHand, show that our method consistently achieves state-of-the-art performance. The project is available at \url{https://github.com/davidpengucf/EPIC}.
Paper Structure (29 sections, 11 equations, 14 figures, 16 tables)

This paper contains 29 sections, 11 equations, 14 figures, 16 tables.

Figures (14)

  • Figure 1: Comparison of (a) previous works and (b) our objective. Previous works align source and target directly, which may result in suboptimal performances due to the mixing of diverse features. Our objective aggregates domain-invariant features and segregates domain-specific features simultaneously to craft a good regressor in the target domain.
  • Figure 2: Hypothesis discrepancy based on multiple relations. Apart from the $r_1$ discrepancy (marked red) that is considered by existing methods, we enhance identical hypotheses consistency by $r_2$ discrepancy (marked blue) and non-identical hypotheses discrimination by $r_3$ discrepancy (marked gray).
  • Figure 3: Pipeline of the supervised 2D HPE, and the source-pretraining process of our method. After passing feature extractor $G$ and regressor $F$, each source image turns to be $K$ heatmaps corresponding to $K$ keypoints. After certain transforms, $K$ 2D keypoints are obtained. We use these heatmaps and coordinates to compute $\mathcal{L}_{mse}$ and $\mathcal{L}_{oks}$.
  • Figure 4: Comparisons between (a) conventional structure and (b) our proposed Intermediate Domain Framework (IDF). Details are provided in Section \ref{['sec:net']}. (Best viewed in color and zoom in)
  • Figure 5: Proposed discrepancy loss. Here the keypoint with a star is used for illustration. Keypoints connected with a cyan arrow are inter-hypothesis identical ($r_1$ relation). $L_{r1}$ in Eq. \ref{['eq:r123']} is used to describe their discrepancy. Keypoints marked with blue arrows are intra-hypothesis non-identical ($r_2$ relation) and represented with $L_{r2}$ in Eq. \ref{['eq:r123']}. Keypoints linked with red arrows are inter-hypothesis non-identical ($r_3$ relation) computed by $L_{r3}$ in Eq. \ref{['eq:r123']}.
  • ...and 9 more figures