Table of Contents
Fetching ...

HaGRID - HAnd Gesture Recognition Image Dataset

Alexander Kapitanov, Karina Kvanchiani, Alexander Nagaev, Roman Kraynov, Andrei Makhliarchuk

TL;DR

An enormous dataset, HaGRID (HAnd Gesture Recognition Image Dataset), is introduced to build a hand gesture recognition (HGR) system concentrating on interaction with devices to manage them and the ability to be used for pretraining models in HGR tasks is demonstrated.

Abstract

This paper introduces an enormous dataset, HaGRID (HAnd Gesture Recognition Image Dataset), to build a hand gesture recognition (HGR) system concentrating on interaction with devices to manage them. That is why all 18 chosen gestures are endowed with the semiotic function and can be interpreted as a specific action. Although the gestures are static, they were picked up, especially for the ability to design several dynamic gestures. It allows the trained model to recognize not only static gestures such as "like" and "stop" but also "swipes" and "drag and drop" dynamic gestures. The HaGRID contains 554,800 images and bounding box annotations with gesture labels to solve hand detection and gesture classification tasks. The low variability in context and subjects of other datasets was the reason for creating the dataset without such limitations. Utilizing crowdsourcing platforms allowed us to collect samples recorded by 37,583 subjects in at least as many scenes with subject-to-camera distances from 0.5 to 4 meters in various natural light conditions. The influence of the diversity characteristics was assessed in ablation study experiments. Also, we demonstrate the HaGRID ability to be used for pretraining models in HGR tasks. The HaGRID and pretrained models are publicly available.

HaGRID - HAnd Gesture Recognition Image Dataset

TL;DR

An enormous dataset, HaGRID (HAnd Gesture Recognition Image Dataset), is introduced to build a hand gesture recognition (HGR) system concentrating on interaction with devices to manage them and the ability to be used for pretraining models in HGR tasks is demonstrated.

Abstract

This paper introduces an enormous dataset, HaGRID (HAnd Gesture Recognition Image Dataset), to build a hand gesture recognition (HGR) system concentrating on interaction with devices to manage them. That is why all 18 chosen gestures are endowed with the semiotic function and can be interpreted as a specific action. Although the gestures are static, they were picked up, especially for the ability to design several dynamic gestures. It allows the trained model to recognize not only static gestures such as "like" and "stop" but also "swipes" and "drag and drop" dynamic gestures. The HaGRID contains 554,800 images and bounding box annotations with gesture labels to solve hand detection and gesture classification tasks. The low variability in context and subjects of other datasets was the reason for creating the dataset without such limitations. Utilizing crowdsourcing platforms allowed us to collect samples recorded by 37,583 subjects in at least as many scenes with subject-to-camera distances from 0.5 to 4 meters in various natural light conditions. The influence of the diversity characteristics was assessed in ablation study experiments. Also, we demonstrate the HaGRID ability to be used for pretraining models in HGR tasks. The HaGRID and pretrained models are publicly available.
Paper Structure (17 sections, 1 equation, 7 figures, 4 tables)

This paper contains 17 sections, 1 equation, 7 figures, 4 tables.

Figures (7)

  • Figure 1: The 18 gesture classes included in HaGRID ("inv." is the abbreviation of "inverted").
  • Figure 2: Bounding box aggregation pipeline. For hard aggregation, consistency checks are applied for all markups before averaging. If it fails, soft aggregation prepares for successful hard aggregation.
  • Figure 3: Image resolution, brightness, subject-to-camera distance, subjects, and class separability analysis. a) image resolution distribution: samples overlap with equal transparency and density reveals quantity, the minimum dimension of 90% images is 1,080; b) subjects' devices: only smartphones, personal computers, and tablets were used while recording; c), d), e) image distribution by subjects in train, validation, and test sets, respectively; f) subject-to-distance distribution: distance was computed as bounding box area relative to the whole image (the boxes occupy up to 16% of the image); g) brightness distribution: images were converted to grayscale, and average pixel brightness was received; h) subjects' countries distribution; i) t-SNE plot by ResNet-18 features.
  • Figure 4: The screenshots from the dynamic gesture recognition demo: a) "swipe right" gesture recognition occurred by detecting serial pair of left-rotated "stop inverted" and right-rotated "stop"; b) "drag and drop" -- by detecting the subsequence: "palm", "fist" and "palm".
  • Figure 5: The impact visualization of such dataset characteristics as a) sample amount, diversity in b) subjects, c) lighting, and d) subject-to-camera distance to train accurate and resilient classifiers. Solid lines correspond to models trained and tested on the HaGRID dataset, whereas the dotted line is the model pretrained on the HaGRID, finetuned on the OUHANDS, and tested on its test set. The F1-score of the trained from scratch on the OUHANDS ResNet-18 is 60.6.
  • ...and 2 more figures