Table of Contents
Fetching ...

Online hand gesture recognition using Continual Graph Transformers

Rim Slama, Wael Rabah, Hazem Wannous

TL;DR

This work tackles online, real-time hand gesture recognition from 3D hand skeleton sequences. It introduces CoSTrGCN, a hybrid architecture that first applies Spatial Graph Convolutional Networks to extract framewise spatial features and then a Transformer Graph Encoder to capture temporal dependencies, augmented by continual learning for streaming data. The authors demonstrate competitive performance on SHREC'21, achieving strong detection and Jaccard metrics while managing low false positives, and discuss the practical implications for human-robot interaction and assistive technologies. The approach is notable for its integration of continual inference with graph-based transformers, enabling robust, low-latency online gesture recognition in dynamic environments.

Abstract

Online continuous action recognition has emerged as a critical research area due to its practical implications in real-world applications, such as human-computer interaction, healthcare, and robotics. Among various modalities, skeleton-based approaches have gained significant popularity, demonstrating their effectiveness in capturing 3D temporal data while ensuring robustness to environmental variations. However, most existing works focus on segment-based recognition, making them unsuitable for real-time, continuous recognition scenarios. In this paper, we propose a novel online recognition system designed for real-time skeleton sequence streaming. Our approach leverages a hybrid architecture combining Spatial Graph Convolutional Networks (S-GCN) for spatial feature extraction and a Transformer-based Graph Encoder (TGE) for capturing temporal dependencies across frames. Additionally, we introduce a continual learning mechanism to enhance model adaptability to evolving data distributions, ensuring robust recognition in dynamic environments. We evaluate our method on the SHREC'21 benchmark dataset, demonstrating its superior performance in online hand gesture recognition. Our approach not only achieves state-of-the-art accuracy but also significantly reduces false positive rates, making it a compelling solution for real-time applications. The proposed system can be seamlessly integrated into various domains, including human-robot collaboration and assistive technologies, where natural and intuitive interaction is crucial.

Online hand gesture recognition using Continual Graph Transformers

TL;DR

This work tackles online, real-time hand gesture recognition from 3D hand skeleton sequences. It introduces CoSTrGCN, a hybrid architecture that first applies Spatial Graph Convolutional Networks to extract framewise spatial features and then a Transformer Graph Encoder to capture temporal dependencies, augmented by continual learning for streaming data. The authors demonstrate competitive performance on SHREC'21, achieving strong detection and Jaccard metrics while managing low false positives, and discuss the practical implications for human-robot interaction and assistive technologies. The approach is notable for its integration of continual inference with graph-based transformers, enabling robust, low-latency online gesture recognition in dynamic environments.

Abstract

Online continuous action recognition has emerged as a critical research area due to its practical implications in real-world applications, such as human-computer interaction, healthcare, and robotics. Among various modalities, skeleton-based approaches have gained significant popularity, demonstrating their effectiveness in capturing 3D temporal data while ensuring robustness to environmental variations. However, most existing works focus on segment-based recognition, making them unsuitable for real-time, continuous recognition scenarios. In this paper, we propose a novel online recognition system designed for real-time skeleton sequence streaming. Our approach leverages a hybrid architecture combining Spatial Graph Convolutional Networks (S-GCN) for spatial feature extraction and a Transformer-based Graph Encoder (TGE) for capturing temporal dependencies across frames. Additionally, we introduce a continual learning mechanism to enhance model adaptability to evolving data distributions, ensuring robust recognition in dynamic environments. We evaluate our method on the SHREC'21 benchmark dataset, demonstrating its superior performance in online hand gesture recognition. Our approach not only achieves state-of-the-art accuracy but also significantly reduces false positive rates, making it a compelling solution for real-time applications. The proposed system can be seamlessly integrated into various domains, including human-robot collaboration and assistive technologies, where natural and intuitive interaction is crucial.

Paper Structure

This paper contains 23 sections, 10 equations, 6 figures, 1 table.

Figures (6)

  • Figure 1: Overview of the CoST-GCN framework for 3D skeleton-based action recognition, integrating spatial and temporal feature extraction. The architecture consists of a spatial GCN module for capturing spatial relationships, followed by a Contextual Transformer Graph Encoder for temporal dependencies. The output is processed through a classifier to predict hand gesture actions.
  • Figure 2: S-GCN unit
  • Figure 6: Edge importance
  • Figure 7: Multi-head self-attention mechanism.
  • Figure 8: Self-Attention head.
  • ...and 1 more figures