Table of Contents
Fetching ...

Toward Efficient Generalization in 3D Human Pose Estimation via a Canonical Domain Approach

Hoosang Lee, Jeha Ryu

TL;DR

The paper tackles the persistent challenge of domain gaps in 3D human pose estimation (HPE), which degrade cross-dataset generalization and typically require data augmentation or target-domain fine-tuning. It introduces a canonical-domain framework that maps both source and target data into a unified canonical space, enabling the lifting network to generalize without target-domain adaptation. Canonicalization rotates 3D poses to align with the camera's principal axis and centers 2D poses in the image plane, ensuring 2D-3D pose consistency and simplifying the learning task; target 2D poses are canonicalized at test time using perspective projection and known intrinsics. Empirical results across Human3.6M, Fit3D, and MPI-INF-3DHP with multiple lifting networks (e.g., DSTformer and others) show improved cross-dataset generalization and data efficiency, with strong performance in cross-dataset settings and competitive MPJPE and superior AUC relative to domain-generalization/adaptation baselines. The work suggests that canonical-domain training plus test-time 2D canonicalization can reduce the computational burden of domain adaptation while maintaining or improving accuracy, and paves the way for synergistic combinations with data augmentation strategies in future work.

Abstract

Recent advancements in deep learning methods have significantly improved the performance of 3D Human Pose Estimation (HPE). However, performance degradation caused by domain gaps between source and target domains remains a major challenge to generalization, necessitating extensive data augmentation and/or fine-tuning for each specific target domain. To address this issue more efficiently, we propose a novel canonical domain approach that maps both the source and target domains into a unified canonical domain, alleviating the need for additional fine-tuning in the target domain. To construct the canonical domain, we introduce a canonicalization process to generate a novel canonical 2D-3D pose mapping that ensures 2D-3D pose consistency and simplifies 2D-3D pose patterns, enabling more efficient training of lifting networks. The canonicalization of both domains is achieved through the following steps: (1) in the source domain, the lifting network is trained within the canonical domain; (2) in the target domain, input 2D poses are canonicalized prior to inference by leveraging the properties of perspective projection and known camera intrinsics. Consequently, the trained network can be directly applied to the target domain without requiring additional fine-tuning. Experiments conducted with various lifting networks and publicly available datasets (e.g., Human3.6M, Fit3D, MPI-INF-3DHP) demonstrate that the proposed method substantially improves generalization capability across datasets while using the same data volume.

Toward Efficient Generalization in 3D Human Pose Estimation via a Canonical Domain Approach

TL;DR

The paper tackles the persistent challenge of domain gaps in 3D human pose estimation (HPE), which degrade cross-dataset generalization and typically require data augmentation or target-domain fine-tuning. It introduces a canonical-domain framework that maps both source and target data into a unified canonical space, enabling the lifting network to generalize without target-domain adaptation. Canonicalization rotates 3D poses to align with the camera's principal axis and centers 2D poses in the image plane, ensuring 2D-3D pose consistency and simplifying the learning task; target 2D poses are canonicalized at test time using perspective projection and known intrinsics. Empirical results across Human3.6M, Fit3D, and MPI-INF-3DHP with multiple lifting networks (e.g., DSTformer and others) show improved cross-dataset generalization and data efficiency, with strong performance in cross-dataset settings and competitive MPJPE and superior AUC relative to domain-generalization/adaptation baselines. The work suggests that canonical-domain training plus test-time 2D canonicalization can reduce the computational burden of domain adaptation while maintaining or improving accuracy, and paves the way for synergistic combinations with data augmentation strategies in future work.

Abstract

Recent advancements in deep learning methods have significantly improved the performance of 3D Human Pose Estimation (HPE). However, performance degradation caused by domain gaps between source and target domains remains a major challenge to generalization, necessitating extensive data augmentation and/or fine-tuning for each specific target domain. To address this issue more efficiently, we propose a novel canonical domain approach that maps both the source and target domains into a unified canonical domain, alleviating the need for additional fine-tuning in the target domain. To construct the canonical domain, we introduce a canonicalization process to generate a novel canonical 2D-3D pose mapping that ensures 2D-3D pose consistency and simplifies 2D-3D pose patterns, enabling more efficient training of lifting networks. The canonicalization of both domains is achieved through the following steps: (1) in the source domain, the lifting network is trained within the canonical domain; (2) in the target domain, input 2D poses are canonicalized prior to inference by leveraging the properties of perspective projection and known camera intrinsics. Consequently, the trained network can be directly applied to the target domain without requiring additional fine-tuning. Experiments conducted with various lifting networks and publicly available datasets (e.g., Human3.6M, Fit3D, MPI-INF-3DHP) demonstrate that the proposed method substantially improves generalization capability across datasets while using the same data volume.

Paper Structure

This paper contains 27 sections, 23 equations, 6 figures, 6 tables.

Figures (6)

  • Figure 1: Upper: x-z plane in camera space (top view); Lower: image plane. (a) Global Position vs. Local Pose Domain Gap: Camera-relative 3D poses exhibiting the same posture can result in a domain gap due to global position. This gap is characterized by variations in the scale and position of their corresponding 2D poses. Conversely, a single camera-relative 3D pose with differing postures leads to a domain gap caused by local pose. In this case, the scale and position remain consistent, but the 2D postures vary. (b) Canonicalization of 2D-3D Pose Pairs: Using the proposed method, each 2D-3D pose pair is canonicalized. The root joints of the 3D poses are aligned with the camera’s principal axis, while the 2D poses are repositioned to the image center. This canonicalization preserves variations in both scale and posture.
  • Figure 2: Example of 2D-3D inconsistency: (a) Two 3D poses with identical postures located at different positions (Position 1 and Position 2) in camera space. (b) Corresponding 2D poses projected onto the image plane, illustrating differences in posture shape. (c) An example of relative rotation caused by perspective projection.
  • Figure 3: Comparison between conventional and canonical 2D-3D mapping. (a) Proposed Canonicalization Process for 3D Pose, (b) Conventional 2D-3D Mapping, (c) Proposed Canonical 2D-3D Pose Mapping.
  • Figure 4: Comparison between Original and Canonical 2D-3D Data Distribution. The (root-relative) 3D pose distributions represent the overall 3D joint positions of the entire dataset in the x-y plane (top view) of the camera frame. The 2D pose distributions depict the overall 2D joint positions of the entire dataset in the image plane.
  • Figure 5: 2D Canonicalization and Inference Process: (1) Target 2D poses are transformed into the normalized image plane by $K_{target}^{-1}$; (2) resulting normalized target poses are rotated by $R_{canon}$ that aligns the pelvis vector $[p_x, p_y, 1]$ to principal axis; and then (3) reprojected to image plane by $K_{target}$; (4) the lifting network predicts the 3D poses from the canonical 2D poses; and (5) the predicted 3D poses are back-transformed by $R_{canon}^{-1}$ for comparison with the ground truth.
  • ...and 1 more figures