Hi5: Synthetic Data for Inclusive, Robust, Hand Pose Estimation
Masum Hasan, Cengiz Ozel, Nina Long, Alexander Martin, Samuel Potter, Tariq Adnan, Sangwu Lee, Ehsan Hoque
TL;DR
Hi5 addresses the shortage of diverse, richly labeled hand pose data for affective computing by introducing a scalable synthetic data pipeline that renders 583,000 labeled hand images using high-fidelity 3D models, diverse skin tones, genders, and expressive gestures. The authors couple automatic pose labeling via invisible markers with a broad diversity strategy (skin tones, HDRI lighting, and pose interpolation) and validate the approach by training ViTPose-based estimators on Hi5 variants, achieving competitive performance against real-data baselines and enhanced robustness to occlusion and demographic variation. Key findings show synthetic data can closely match real-data performance on pose estimation benchmarks, particularly under challenging conditions, while enabling fairer representation across skin tones. The work demonstrates the practical viability of synthetic data for emotion-aware gesture recognition and provides open-source tools to accelerate research in inclusive, expressive hand pose estimation.
Abstract
Hand pose estimation plays a vital role in capturing subtle nonverbal cues essential for understanding human affect. However, collecting diverse, expressive real-world data remains challenging due to labor-intensive manual annotation that often underrepresents demographic diversity and natural expressions. To address this issue, we introduce a cost-effective approach to generating synthetic data using high-fidelity 3D hand models and a wide range of affective hand poses. Our method includes varied skin tones, genders, dynamic environments, realistic lighting conditions, and diverse naturally occurring gesture animations. The resulting dataset, Hi5, contains 583,000 pose-annotated images, carefully balanced to reflect natural diversity and emotional expressiveness. Models trained exclusively on Hi5 achieve performance comparable to human-annotated datasets, exhibiting superior robustness to occlusions and consistent accuracy across diverse skin tones -- which is crucial for reliably recognizing expressive gestures in affective computing applications. Our results demonstrate that synthetic data effectively addresses critical limitations of existing datasets, enabling more inclusive, expressive, and reliable gesture recognition systems while achieving competitive performance in pose estimation benchmarks. The Hi5 dataset, data synthesis pipeline, source code, and game engine project are publicly released to support further research in synthetic hand-gesture applications.
