Table of Contents
Fetching ...

Robust Dynamic Gesture Recognition at Ultra-Long Distances

Eran Bamani Beeri, Eden Nissinman, Avishai Sintov

TL;DR

This letter presents a novel approach to recognizing dynamic gestures in an ultra-range distance of up to 28 meters, enabling natural, directive communication for guiding robots in both indoor and outdoor environments and introduces a distance-weighted loss function shown to enhance learning and improve model robustness at varying distances.

Abstract

Dynamic hand gestures play a crucial role in conveying nonverbal information for Human-Robot Interaction (HRI), eliminating the need for complex interfaces. Current models for dynamic gesture recognition suffer from limitations in effective recognition range, restricting their application to close proximity scenarios. In this letter, we present a novel approach to recognizing dynamic gestures in an ultra-range distance of up to 28 meters, enabling natural, directive communication for guiding robots in both indoor and outdoor environments. Our proposed SlowFast-Transformer (SFT) model effectively integrates the SlowFast architecture with Transformer layers to efficiently process and classify gesture sequences captured at ultra-range distances, overcoming challenges of low resolution and environmental noise. We further introduce a distance-weighted loss function shown to enhance learning and improve model robustness at varying distances. Our model demonstrates significant performance improvement over state-of-the-art gesture recognition frameworks, achieving a recognition accuracy of 95.1% on a diverse dataset with challenging ultra-range gestures. This enables robots to react appropriately to human commands from a far distance, providing an essential enhancement in HRI, especially in scenarios requiring seamless and natural interaction.

Robust Dynamic Gesture Recognition at Ultra-Long Distances

TL;DR

This letter presents a novel approach to recognizing dynamic gestures in an ultra-range distance of up to 28 meters, enabling natural, directive communication for guiding robots in both indoor and outdoor environments and introduces a distance-weighted loss function shown to enhance learning and improve model robustness at varying distances.

Abstract

Dynamic hand gestures play a crucial role in conveying nonverbal information for Human-Robot Interaction (HRI), eliminating the need for complex interfaces. Current models for dynamic gesture recognition suffer from limitations in effective recognition range, restricting their application to close proximity scenarios. In this letter, we present a novel approach to recognizing dynamic gestures in an ultra-range distance of up to 28 meters, enabling natural, directive communication for guiding robots in both indoor and outdoor environments. Our proposed SlowFast-Transformer (SFT) model effectively integrates the SlowFast architecture with Transformer layers to efficiently process and classify gesture sequences captured at ultra-range distances, overcoming challenges of low resolution and environmental noise. We further introduce a distance-weighted loss function shown to enhance learning and improve model robustness at varying distances. Our model demonstrates significant performance improvement over state-of-the-art gesture recognition frameworks, achieving a recognition accuracy of 95.1% on a diverse dataset with challenging ultra-range gestures. This enables robots to react appropriately to human commands from a far distance, providing an essential enhancement in HRI, especially in scenarios requiring seamless and natural interaction.

Paper Structure

This paper contains 13 sections, 5 equations, 7 figures, 4 tables.

Figures (7)

  • Figure 2: Demonstration of a user instructing a robot to go back by sweeping an open palm forward and backward, from an ultra-range distance. In addition to the low-resolution view of the user's hand, the robot may confuse the dynamic gesture with the static stop gesture.
  • Figure 3: Overview of the proposed SFT framework for dynamic hand gesture recognition. The framework starts with feature extraction using ResNet, followed by frame reduction using K-Means clustering. User detection is performed using YOLOv3, with the output frames resized to $224 \times 224$. The reduced frames are processed through the SlowFast network to capture both slow and fast motion dynamics. The outputted features from the Slow and Fast pathways are concatenated and passed to a Transformer encoder to capture temporal dependencies. Finally, a classification head is employed to acquire the gesture class.
  • Figure 4: The eight dynamic gestures used in the analysis include: (a) beckoning, (b) go-back, (c) move-right, (d) move-left, (e) turn-around, (f) follow-me, (g) go-down and (h) go-up.
  • Figure 5: Gesture recognition success rate of the SFT model with regard to the distance $d$ of the user from the camera.
  • Figure 6: Confusion matrix for the gesture classification with the SFT model across 13 gesture classes.
  • ...and 2 more figures