Table of Contents
Fetching ...

Hi5: Synthetic Data for Inclusive, Robust, Hand Pose Estimation

Masum Hasan, Cengiz Ozel, Nina Long, Alexander Martin, Samuel Potter, Tariq Adnan, Sangwu Lee, Ehsan Hoque

TL;DR

Hi5 addresses the shortage of diverse, richly labeled hand pose data for affective computing by introducing a scalable synthetic data pipeline that renders 583,000 labeled hand images using high-fidelity 3D models, diverse skin tones, genders, and expressive gestures. The authors couple automatic pose labeling via invisible markers with a broad diversity strategy (skin tones, HDRI lighting, and pose interpolation) and validate the approach by training ViTPose-based estimators on Hi5 variants, achieving competitive performance against real-data baselines and enhanced robustness to occlusion and demographic variation. Key findings show synthetic data can closely match real-data performance on pose estimation benchmarks, particularly under challenging conditions, while enabling fairer representation across skin tones. The work demonstrates the practical viability of synthetic data for emotion-aware gesture recognition and provides open-source tools to accelerate research in inclusive, expressive hand pose estimation.

Abstract

Hand pose estimation plays a vital role in capturing subtle nonverbal cues essential for understanding human affect. However, collecting diverse, expressive real-world data remains challenging due to labor-intensive manual annotation that often underrepresents demographic diversity and natural expressions. To address this issue, we introduce a cost-effective approach to generating synthetic data using high-fidelity 3D hand models and a wide range of affective hand poses. Our method includes varied skin tones, genders, dynamic environments, realistic lighting conditions, and diverse naturally occurring gesture animations. The resulting dataset, Hi5, contains 583,000 pose-annotated images, carefully balanced to reflect natural diversity and emotional expressiveness. Models trained exclusively on Hi5 achieve performance comparable to human-annotated datasets, exhibiting superior robustness to occlusions and consistent accuracy across diverse skin tones -- which is crucial for reliably recognizing expressive gestures in affective computing applications. Our results demonstrate that synthetic data effectively addresses critical limitations of existing datasets, enabling more inclusive, expressive, and reliable gesture recognition systems while achieving competitive performance in pose estimation benchmarks. The Hi5 dataset, data synthesis pipeline, source code, and game engine project are publicly released to support further research in synthetic hand-gesture applications.

Hi5: Synthetic Data for Inclusive, Robust, Hand Pose Estimation

TL;DR

Hi5 addresses the shortage of diverse, richly labeled hand pose data for affective computing by introducing a scalable synthetic data pipeline that renders 583,000 labeled hand images using high-fidelity 3D models, diverse skin tones, genders, and expressive gestures. The authors couple automatic pose labeling via invisible markers with a broad diversity strategy (skin tones, HDRI lighting, and pose interpolation) and validate the approach by training ViTPose-based estimators on Hi5 variants, achieving competitive performance against real-data baselines and enhanced robustness to occlusion and demographic variation. Key findings show synthetic data can closely match real-data performance on pose estimation benchmarks, particularly under challenging conditions, while enabling fairer representation across skin tones. The work demonstrates the practical viability of synthetic data for emotion-aware gesture recognition and provides open-source tools to accelerate research in inclusive, expressive hand pose estimation.

Abstract

Hand pose estimation plays a vital role in capturing subtle nonverbal cues essential for understanding human affect. However, collecting diverse, expressive real-world data remains challenging due to labor-intensive manual annotation that often underrepresents demographic diversity and natural expressions. To address this issue, we introduce a cost-effective approach to generating synthetic data using high-fidelity 3D hand models and a wide range of affective hand poses. Our method includes varied skin tones, genders, dynamic environments, realistic lighting conditions, and diverse naturally occurring gesture animations. The resulting dataset, Hi5, contains 583,000 pose-annotated images, carefully balanced to reflect natural diversity and emotional expressiveness. Models trained exclusively on Hi5 achieve performance comparable to human-annotated datasets, exhibiting superior robustness to occlusions and consistent accuracy across diverse skin tones -- which is crucial for reliably recognizing expressive gestures in affective computing applications. Our results demonstrate that synthetic data effectively addresses critical limitations of existing datasets, enabling more inclusive, expressive, and reliable gesture recognition systems while achieving competitive performance in pose estimation benchmarks. The Hi5 dataset, data synthesis pipeline, source code, and game engine project are publicly released to support further research in synthetic hand-gesture applications.
Paper Structure (20 sections, 12 figures, 4 tables)

This paper contains 20 sections, 12 figures, 4 tables.

Figures (12)

  • Figure 1: a) Invisible marker objects (visualized with red dots) within a fully rigged 3D hand armature capable of making natural hand expressions, and b) the projection of a realistic 3D hand model into a 2D scene. This projection enables automatic, precise, and consistent pose labeling—a key step in generating synthetic data that captures natural hand configurations, even under occlusion.
  • Figure 2: Our synthetic data pipeline allows us precise control of skin color representation. This figure demonstrates the female hand models used in Hi5 with skin tone ITA values respectively -80, -30, 10, 28, 41, and 55 based on dermatology literaturechardon1991skinita. Full ITA scale Appendix Figure \ref{['fig:ita-range']}.
  • Figure 3: Sample of different affective hand pose images from the Hi5 dataset with automatically generated pose labels. Affect labels are estimations, as different poses interpolate between each other.
  • Figure 4: Nearly identical hand poses in our dataset have a surprisingly diverse image representation. (Pose: Rocker)
  • Figure 5: Visual results of predictions by ViT Pose Small model trained with Hi5 Large and OneHand10K dataset (best viewed on color and zoomed in).
  • ...and 7 more figures