Table of Contents
Fetching ...

Boosting Gesture Recognition with an Automatic Gesture Annotation Framework

Junxiao Shen, Xuhai Xu, Ran Tan, Amy Karlson, Evan Strasnick

TL;DR

This framework consists of a novel annotation model that leverages the Connectionist Temporal Classification (CTC) loss, and a semi-supervised learning pipeline that enables the model to improve its performance by training on its own predictions, known as pseudo labels.

Abstract

Training a real-time gesture recognition model heavily relies on annotated data. However, manual data annotation is costly and demands substantial human effort. In order to address this challenge, we propose a framework that can automatically annotate gesture classes and identify their temporal ranges. Our framework consists of two key components: (1) a novel annotation model that leverages the Connectionist Temporal Classification (CTC) loss, and (2) a semi-supervised learning pipeline that enables the model to improve its performance by training on its own predictions, known as pseudo labels. These high-quality pseudo labels can also be used to enhance the accuracy of other downstream gesture recognition models. To evaluate our framework, we conducted experiments using two publicly available gesture datasets. Our ablation study demonstrates that our annotation model design surpasses the baseline in terms of both gesture classification accuracy (3-4% improvement) and localization accuracy (71-75% improvement). Additionally, we illustrate that the pseudo-labeled dataset produced from the proposed framework significantly boosts the accuracy of a pre-trained downstream gesture recognition model by 11-18%. We believe that this annotation framework has immense potential to improve the training of downstream gesture recognition models using unlabeled datasets.

Boosting Gesture Recognition with an Automatic Gesture Annotation Framework

TL;DR

This framework consists of a novel annotation model that leverages the Connectionist Temporal Classification (CTC) loss, and a semi-supervised learning pipeline that enables the model to improve its performance by training on its own predictions, known as pseudo labels.

Abstract

Training a real-time gesture recognition model heavily relies on annotated data. However, manual data annotation is costly and demands substantial human effort. In order to address this challenge, we propose a framework that can automatically annotate gesture classes and identify their temporal ranges. Our framework consists of two key components: (1) a novel annotation model that leverages the Connectionist Temporal Classification (CTC) loss, and (2) a semi-supervised learning pipeline that enables the model to improve its performance by training on its own predictions, known as pseudo labels. These high-quality pseudo labels can also be used to enhance the accuracy of other downstream gesture recognition models. To evaluate our framework, we conducted experiments using two publicly available gesture datasets. Our ablation study demonstrates that our annotation model design surpasses the baseline in terms of both gesture classification accuracy (3-4% improvement) and localization accuracy (71-75% improvement). Additionally, we illustrate that the pseudo-labeled dataset produced from the proposed framework significantly boosts the accuracy of a pre-trained downstream gesture recognition model by 11-18%. We believe that this annotation framework has immense potential to improve the training of downstream gesture recognition models using unlabeled datasets.
Paper Structure (26 sections, 4 equations, 3 figures, 3 tables)

This paper contains 26 sections, 4 equations, 3 figures, 3 tables.

Figures (3)

  • Figure 1: Our framework provides an automatic gesture annotation solution. The red borders highlight our main contribution. The framework consists of two components: (1) a novel annotation model that utilizes Connectionist Temporal Classification (CTC) loss (see Figure \ref{['fig:annotation_model']}), and (2) a semi-supervised pipeline that improves the model's performance by training on its own predictions, i.e., pseudo labels. In real-life (open-world), the gesture annotation framework operates as follows: first, a real-time gesture recognition model and our proposed annotation model are pre-trained on a small labeled dataset. Next, the annotation model is integrated into the pseudo-labeling process, where it produces pseudo labels by annotating an unlabeled dataset, which is then used to augment the annotation model. After pseudo-labeling is complete, the final high-quality pseudo-labels are used to fine-tune the pre-trained real-time gesture recognition model.
  • Figure 2: nDue to the extended input window of the annotation model, the true label has the capacity to encompass multiple gestures simultaneously. In the illustrated case, the true labels are {1, - , 3, - ,2}, where "-" represents background activities (no gesture). The output of the annotation model is a sequence of predicted labels that has the same length as inputs. The loss function for the proposed annotation model is a Connectionist Temporal Classification (CTC) loss. The backbone of the model is an A-ResLSTM we adopted from Shen et al. Shen2022Gesture.
  • Figure 3: Demonstration of the network output probabilities from a CTC and CE (cross-entropy) trained network versus the ground truth vandersteegen2020low. The spike of CTC loss clearly captures the gesture nucleus, while the curve of CE loss needs post-processing. For example, the CTC loss successfully distinguishes two consecutive same gestures, while the CE loss confuses them together.