Table of Contents
Fetching ...

HaGRIDv2: 1M Images for Static and Dynamic Hand Gesture Recognition

Anton Nuzhdin, Alexander Nagaev, Alexander Sautin, Alexander Kapitanov, Karina Kvanchiani

TL;DR

HaGRIDv2 tackles the need for a comprehensive, large-scale hand gesture dataset suitable for both static and dynamic recognition in real-world HCI scenarios like video conferencing and home automation. It introduces 15 new static gestures and a diversified 'no gesture' class, plus an extended dynamic gesture recognition algorithm that supports swipes, zooms, clicks, and drag-and-drops, all while maintaining lightweight CPU-friendly inference. The work demonstrates improved cross-dataset generalization, stronger pre-training benefits, and enhanced gesture generation quality via diffusion models, supported by extensive ablations and cross-dataset evaluations. By releasing HaGRIDv2, pre-trained models, and the dynamic gesture algorithm, the study provides a practical, scalable resource for developing robust gesture-based interfaces on edge devices.

Abstract

This paper proposes the second version of the widespread Hand Gesture Recognition dataset HaGRID -- HaGRIDv2. We cover 15 new gestures with conversation and control functions, including two-handed ones. Building on the foundational concepts proposed by HaGRID's authors, we implemented the dynamic gesture recognition algorithm and further enhanced it by adding three new groups of manipulation gestures. The ``no gesture" class was diversified by adding samples of natural hand movements, which allowed us to minimize false positives by 6 times. Combining extra samples with HaGRID, the received version outperforms the original in pre-training models for gesture-related tasks. Besides, we achieved the best generalization ability among gesture and hand detection datasets. In addition, the second version enhances the quality of the gestures generated by the diffusion model. HaGRIDv2, pre-trained models, and a dynamic gesture recognition algorithm are publicly available.

HaGRIDv2: 1M Images for Static and Dynamic Hand Gesture Recognition

TL;DR

HaGRIDv2 tackles the need for a comprehensive, large-scale hand gesture dataset suitable for both static and dynamic recognition in real-world HCI scenarios like video conferencing and home automation. It introduces 15 new static gestures and a diversified 'no gesture' class, plus an extended dynamic gesture recognition algorithm that supports swipes, zooms, clicks, and drag-and-drops, all while maintaining lightweight CPU-friendly inference. The work demonstrates improved cross-dataset generalization, stronger pre-training benefits, and enhanced gesture generation quality via diffusion models, supported by extensive ablations and cross-dataset evaluations. By releasing HaGRIDv2, pre-trained models, and the dynamic gesture algorithm, the study provides a practical, scalable resource for developing robust gesture-based interfaces on edge devices.

Abstract

This paper proposes the second version of the widespread Hand Gesture Recognition dataset HaGRID -- HaGRIDv2. We cover 15 new gestures with conversation and control functions, including two-handed ones. Building on the foundational concepts proposed by HaGRID's authors, we implemented the dynamic gesture recognition algorithm and further enhanced it by adding three new groups of manipulation gestures. The ``no gesture" class was diversified by adding samples of natural hand movements, which allowed us to minimize false positives by 6 times. Combining extra samples with HaGRID, the received version outperforms the original in pre-training models for gesture-related tasks. Besides, we achieved the best generalization ability among gesture and hand detection datasets. In addition, the second version enhances the quality of the gestures generated by the diffusion model. HaGRIDv2, pre-trained models, and a dynamic gesture recognition algorithm are publicly available.

Paper Structure

This paper contains 21 sections, 14 figures, 6 tables.

Figures (14)

  • Figure 1: The 15 outlined in red new gesture classes added to HaGRID's 18 ones ("inv" stands for "inverted").
  • Figure 2: The key statistics of HaGRIDv2. (a) Image resolution distribution showing the scatter of image dimensions; (b-d) Distribution of subjects in the training, validation, and test sets, respectively; (e) Bounding box area distribution; (f) Brightness distribution; (g-i) Age and gender distributions of subjects, received automatically by MiVOLO mivolo neural network; (j) Racial distribution of subjects, received automatically by FairFace fairface neural network.
  • Figure 3: Impact of pre-training on gesture classification and detection across HaGRID and HaGRIDv2. "Original" metrics are sourced from the respective dataset papers; missing values indicate metrics not reported by the authors.
  • Figure 4: Comparing the false positives on the "no gesture" class for HaGRID and HaGRIDv2 datasets.
  • Figure 5: The screenshots from the dynamic gesture recognition demo. The bounding boxes highlight detected gestures with their class labels. Each dynamic gesture is marked related to its function: yellow arrows indicate swipe directions, green and blue circles represent drag and drop, respectively, "click" and "double-click" display their corresponding gestures, green arrows or a stretchable blue rectangle for zoom gestures.
  • ...and 9 more figures