Table of Contents
Fetching ...

Self-supervised Optimization of Hand Pose Estimation using Anatomical Features and Iterative Learning

Christian Jauch, Timo Leitritz, Marco F. Huber

TL;DR

The paper tackles the challenge of robust hand pose–based activity recognition in manual assembly, where gloves and domain shifts degrade pose estimates. It proposes a self-supervised pipeline that generates an application-specific dataset from unlabeled video by detecting hands, enforcing anatomical and contextual constraints, and applying temporal smoothing, then retrains both the hand detector and the pose estimator in an iterative loop (up to three iterations). Evaluations on HanCo for parameterization and on unlabelled Fraunhofer IPA assembly videos show that retraining improves detection, pose estimation, and downstream activity recognition with PoseConv3D, achieving a top-1 accuracy of $0.750 \pm 0.016$ and mean class accuracy of $0.738 \pm 0.022$ in glove-wearing scenarios. The approach demonstrates a practical pathway to cheap, robust, application-specific hand pose systems for manufacturing, reducing dependence on large labeled datasets and enabling closer human-centered interaction with assembly processes.

Abstract

Manual assembly workers face increasing complexity in their work. Human-centered assistance systems could help, but object recognition as an enabling technology hinders sophisticated human-centered design of these systems. At the same time, activity recognition based on hand poses suffers from poor pose estimation in complex usage scenarios, such as wearing gloves. This paper presents a self-supervised pipeline for adapting hand pose estimation to specific use cases with minimal human interaction. This enables cheap and robust hand posebased activity recognition. The pipeline consists of a general machine learning model for hand pose estimation trained on a generalized dataset, spatial and temporal filtering to account for anatomical constraints of the hand, and a retraining step to improve the model. Different parameter combinations are evaluated on a publicly available and annotated dataset. The best parameter and model combination is then applied to unlabelled videos from a manual assembly scenario. The effectiveness of the pipeline is demonstrated by training an activity recognition as a downstream task in the manual assembly scenario.

Self-supervised Optimization of Hand Pose Estimation using Anatomical Features and Iterative Learning

TL;DR

The paper tackles the challenge of robust hand pose–based activity recognition in manual assembly, where gloves and domain shifts degrade pose estimates. It proposes a self-supervised pipeline that generates an application-specific dataset from unlabeled video by detecting hands, enforcing anatomical and contextual constraints, and applying temporal smoothing, then retrains both the hand detector and the pose estimator in an iterative loop (up to three iterations). Evaluations on HanCo for parameterization and on unlabelled Fraunhofer IPA assembly videos show that retraining improves detection, pose estimation, and downstream activity recognition with PoseConv3D, achieving a top-1 accuracy of and mean class accuracy of in glove-wearing scenarios. The approach demonstrates a practical pathway to cheap, robust, application-specific hand pose systems for manufacturing, reducing dependence on large labeled datasets and enabling closer human-centered interaction with assembly processes.

Abstract

Manual assembly workers face increasing complexity in their work. Human-centered assistance systems could help, but object recognition as an enabling technology hinders sophisticated human-centered design of these systems. At the same time, activity recognition based on hand poses suffers from poor pose estimation in complex usage scenarios, such as wearing gloves. This paper presents a self-supervised pipeline for adapting hand pose estimation to specific use cases with minimal human interaction. This enables cheap and robust hand posebased activity recognition. The pipeline consists of a general machine learning model for hand pose estimation trained on a generalized dataset, spatial and temporal filtering to account for anatomical constraints of the hand, and a retraining step to improve the model. Different parameter combinations are evaluated on a publicly available and annotated dataset. The best parameter and model combination is then applied to unlabelled videos from a manual assembly scenario. The effectiveness of the pipeline is demonstrated by training an activity recognition as a downstream task in the manual assembly scenario.
Paper Structure (17 sections, 5 figures, 3 tables)

This paper contains 17 sections, 5 figures, 3 tables.

Figures (5)

  • Figure 1: Examples of the manual assembly dataset recorded at Fraunhofer IPA. A simple lock is assembled by three different people all of them wearing three different types of gloves and no gloves. Each person assembles the lock twice with each glove and bare hands.
  • Figure 2: Overview of the pipeline. a) Initial generation of hand pose dataset candidates. b) Filtering of candidates and generation of dataset. c) Retraining of hand detector with dataset from step b). d) Retraining of pose estimator with dataset from step b). After performing step c) and d), step a) and b) is repeated with the updated models.
  • Figure 3: Precision and recall curves of the initial model candidates.
  • Figure 4: AUC over different keypoint thresholds of the initial model candidates.
  • Figure 5: Evaluation of different Confidence Thresholds with IoU=0.75.