Analyzing the Synthetic-to-Real Domain Gap in 3D Hand Pose Estimation
Zhuoran Zhao, Linlin Yang, Pengzhan Sun, Pan Hui, Angela Yao
TL;DR
This work systematically analyzes the synthetic-to-real gap in 3D hand pose estimation by decomposing synthetic data into hand components, occlusions, and skeleton topology, and by introducing a high-quality synthesis pipeline based on the NIMBLE hand model. It shows that synthetic data can match real-data performance when arms, diverse textures, backgrounds, pose coverage, and object occlusions are properly incorporated, and that mixing synthetic with real data improves in-domain and cross-domain generalization. Key contributions include a compositional rendering approach with arm/object occlusion priors, an amplitude spectrum augmentation strategy, and an examination of topology differences between synthetic and real hand templates. The findings enable scalable synthetic-data-driven hand pose estimation and provide practical guidance for reducing domain gaps in related vision tasks.
Abstract
Recent synthetic 3D human datasets for the face, body, and hands have pushed the limits on photorealism. Face recognition and body pose estimation have achieved state-of-the-art performance using synthetic training data alone, but for the hand, there is still a large synthetic-to-real gap. This paper presents the first systematic study of the synthetic-to-real gap of 3D hand pose estimation. We analyze the gap and identify key components such as the forearm, image frequency statistics, hand pose, and object occlusions. To facilitate our analysis, we propose a data synthesis pipeline to synthesize high-quality data. We demonstrate that synthetic hand data can achieve the same level of accuracy as real data when integrating our identified components, paving the path to use synthetic data alone for hand pose estimation. Code and data are available at: https://github.com/delaprada/HandSynthesis.git.
