Toward Efficient Generalization in 3D Human Pose Estimation via a Canonical Domain Approach
Hoosang Lee, Jeha Ryu
TL;DR
The paper tackles the persistent challenge of domain gaps in 3D human pose estimation (HPE), which degrade cross-dataset generalization and typically require data augmentation or target-domain fine-tuning. It introduces a canonical-domain framework that maps both source and target data into a unified canonical space, enabling the lifting network to generalize without target-domain adaptation. Canonicalization rotates 3D poses to align with the camera's principal axis and centers 2D poses in the image plane, ensuring 2D-3D pose consistency and simplifying the learning task; target 2D poses are canonicalized at test time using perspective projection and known intrinsics. Empirical results across Human3.6M, Fit3D, and MPI-INF-3DHP with multiple lifting networks (e.g., DSTformer and others) show improved cross-dataset generalization and data efficiency, with strong performance in cross-dataset settings and competitive MPJPE and superior AUC relative to domain-generalization/adaptation baselines. The work suggests that canonical-domain training plus test-time 2D canonicalization can reduce the computational burden of domain adaptation while maintaining or improving accuracy, and paves the way for synergistic combinations with data augmentation strategies in future work.
Abstract
Recent advancements in deep learning methods have significantly improved the performance of 3D Human Pose Estimation (HPE). However, performance degradation caused by domain gaps between source and target domains remains a major challenge to generalization, necessitating extensive data augmentation and/or fine-tuning for each specific target domain. To address this issue more efficiently, we propose a novel canonical domain approach that maps both the source and target domains into a unified canonical domain, alleviating the need for additional fine-tuning in the target domain. To construct the canonical domain, we introduce a canonicalization process to generate a novel canonical 2D-3D pose mapping that ensures 2D-3D pose consistency and simplifies 2D-3D pose patterns, enabling more efficient training of lifting networks. The canonicalization of both domains is achieved through the following steps: (1) in the source domain, the lifting network is trained within the canonical domain; (2) in the target domain, input 2D poses are canonicalized prior to inference by leveraging the properties of perspective projection and known camera intrinsics. Consequently, the trained network can be directly applied to the target domain without requiring additional fine-tuning. Experiments conducted with various lifting networks and publicly available datasets (e.g., Human3.6M, Fit3D, MPI-INF-3DHP) demonstrate that the proposed method substantially improves generalization capability across datasets while using the same data volume.
