Table of Contents
Fetching ...

Analyzing the Synthetic-to-Real Domain Gap in 3D Hand Pose Estimation

Zhuoran Zhao, Linlin Yang, Pengzhan Sun, Pan Hui, Angela Yao

TL;DR

This work systematically analyzes the synthetic-to-real gap in 3D hand pose estimation by decomposing synthetic data into hand components, occlusions, and skeleton topology, and by introducing a high-quality synthesis pipeline based on the NIMBLE hand model. It shows that synthetic data can match real-data performance when arms, diverse textures, backgrounds, pose coverage, and object occlusions are properly incorporated, and that mixing synthetic with real data improves in-domain and cross-domain generalization. Key contributions include a compositional rendering approach with arm/object occlusion priors, an amplitude spectrum augmentation strategy, and an examination of topology differences between synthetic and real hand templates. The findings enable scalable synthetic-data-driven hand pose estimation and provide practical guidance for reducing domain gaps in related vision tasks.

Abstract

Recent synthetic 3D human datasets for the face, body, and hands have pushed the limits on photorealism. Face recognition and body pose estimation have achieved state-of-the-art performance using synthetic training data alone, but for the hand, there is still a large synthetic-to-real gap. This paper presents the first systematic study of the synthetic-to-real gap of 3D hand pose estimation. We analyze the gap and identify key components such as the forearm, image frequency statistics, hand pose, and object occlusions. To facilitate our analysis, we propose a data synthesis pipeline to synthesize high-quality data. We demonstrate that synthetic hand data can achieve the same level of accuracy as real data when integrating our identified components, paving the path to use synthetic data alone for hand pose estimation. Code and data are available at: https://github.com/delaprada/HandSynthesis.git.

Analyzing the Synthetic-to-Real Domain Gap in 3D Hand Pose Estimation

TL;DR

This work systematically analyzes the synthetic-to-real gap in 3D hand pose estimation by decomposing synthetic data into hand components, occlusions, and skeleton topology, and by introducing a high-quality synthesis pipeline based on the NIMBLE hand model. It shows that synthetic data can match real-data performance when arms, diverse textures, backgrounds, pose coverage, and object occlusions are properly incorporated, and that mixing synthetic with real data improves in-domain and cross-domain generalization. Key contributions include a compositional rendering approach with arm/object occlusion priors, an amplitude spectrum augmentation strategy, and an examination of topology differences between synthetic and real hand templates. The findings enable scalable synthetic-data-driven hand pose estimation and provide practical guidance for reducing domain gaps in related vision tasks.

Abstract

Recent synthetic 3D human datasets for the face, body, and hands have pushed the limits on photorealism. Face recognition and body pose estimation have achieved state-of-the-art performance using synthetic training data alone, but for the hand, there is still a large synthetic-to-real gap. This paper presents the first systematic study of the synthetic-to-real gap of 3D hand pose estimation. We analyze the gap and identify key components such as the forearm, image frequency statistics, hand pose, and object occlusions. To facilitate our analysis, we propose a data synthesis pipeline to synthesize high-quality data. We demonstrate that synthetic hand data can achieve the same level of accuracy as real data when integrating our identified components, paving the path to use synthetic data alone for hand pose estimation. Code and data are available at: https://github.com/delaprada/HandSynthesis.git.

Paper Structure

This paper contains 29 sections, 6 equations, 17 figures, 11 tables, 1 algorithm.

Figures (17)

  • Figure 1: We present a systematic study of the synthetic-to-real gap for 3D hand pose estimation by decomposing the synthetic data to establish associations between hand image components and predictions.
  • Figure 1: Finger-level occlusion annotation preparation.
  • Figure 2: (a) Real images (blue) have richer hand texture information and complex environments than synthetic images (red). Arms and interacting objects are often present in real images. (b) Real datasets have higher variation in amplitude values than synthetic ones across all frequency bands.
  • Figure 2: Hand skeleton topology difference between NIMBLE template and MANO template. The green dot denotes the NIMBLE joints and the red dot denotes the MANO joints.
  • Figure 3: Image synthesis process.
  • ...and 12 more figures