Self-supervised Optimization of Hand Pose Estimation using Anatomical Features and Iterative Learning
Christian Jauch, Timo Leitritz, Marco F. Huber
TL;DR
The paper tackles the challenge of robust hand pose–based activity recognition in manual assembly, where gloves and domain shifts degrade pose estimates. It proposes a self-supervised pipeline that generates an application-specific dataset from unlabeled video by detecting hands, enforcing anatomical and contextual constraints, and applying temporal smoothing, then retrains both the hand detector and the pose estimator in an iterative loop (up to three iterations). Evaluations on HanCo for parameterization and on unlabelled Fraunhofer IPA assembly videos show that retraining improves detection, pose estimation, and downstream activity recognition with PoseConv3D, achieving a top-1 accuracy of $0.750 \pm 0.016$ and mean class accuracy of $0.738 \pm 0.022$ in glove-wearing scenarios. The approach demonstrates a practical pathway to cheap, robust, application-specific hand pose systems for manufacturing, reducing dependence on large labeled datasets and enabling closer human-centered interaction with assembly processes.
Abstract
Manual assembly workers face increasing complexity in their work. Human-centered assistance systems could help, but object recognition as an enabling technology hinders sophisticated human-centered design of these systems. At the same time, activity recognition based on hand poses suffers from poor pose estimation in complex usage scenarios, such as wearing gloves. This paper presents a self-supervised pipeline for adapting hand pose estimation to specific use cases with minimal human interaction. This enables cheap and robust hand posebased activity recognition. The pipeline consists of a general machine learning model for hand pose estimation trained on a generalized dataset, spatial and temporal filtering to account for anatomical constraints of the hand, and a retraining step to improve the model. Different parameter combinations are evaluated on a publicly available and annotated dataset. The best parameter and model combination is then applied to unlabelled videos from a manual assembly scenario. The effectiveness of the pipeline is demonstrated by training an activity recognition as a downstream task in the manual assembly scenario.
