Table of Contents
Fetching ...

Dynamic Gesture Recognition in Ultra-Range Distance for Effective Human-Robot Interaction

Eran Bamani Beeri, Eden Nissinman, Avishai Sintov

TL;DR

The paper tackles ultra-range gesture recognition for Human-Robot Interaction by extending gesture understanding to distances up to $[4,28]$ meters. It introduces the Temporal-Spatiotemporal Fusion Network (TSFN), which combines Temporal Convolutional Networks (TCN) and $R(2+1)D$ convolutions to capture temporal and spatiotemporal cues in video data. A distance-aware composite loss $\mathcal{L} = \mathcal{L}_{CE} + \alpha \mathcal{L}_{global} + \beta \mathcal{L}_{dist} + \gamma \mathcal{L}_{robust}$, with $\mathcal{L}_{dist} = \frac{1}{N} \sum d_i \cdot \mathcal{L}_{CE}(v_i, y_i)$, penalizes errors more at longer distances and promotes robustness. Empirical results on a six-gesture dataset show TSFN achieving up to 96.1% accuracy and outperformance over ViViT, TCN, R(2+1)D, and CNN-based baselines, indicating strong potential for real-time HRI in service robots, drones, and search-and-rescue contexts.

Abstract

This paper presents a novel approach for ultra-range gesture recognition, addressing Human-Robot Interaction (HRI) challenges over extended distances. By leveraging human gestures in video data, we propose the Temporal-Spatiotemporal Fusion Network (TSFN) model that surpasses the limitations of current methods, enabling robots to understand gestures from long distances. With applications in service robots, search and rescue operations, and drone-based interactions, our approach enhances HRI in expansive environments. Experimental validation demonstrates significant advancements in gesture recognition accuracy, particularly in prolonged gesture sequences.

Dynamic Gesture Recognition in Ultra-Range Distance for Effective Human-Robot Interaction

TL;DR

The paper tackles ultra-range gesture recognition for Human-Robot Interaction by extending gesture understanding to distances up to meters. It introduces the Temporal-Spatiotemporal Fusion Network (TSFN), which combines Temporal Convolutional Networks (TCN) and convolutions to capture temporal and spatiotemporal cues in video data. A distance-aware composite loss , with , penalizes errors more at longer distances and promotes robustness. Empirical results on a six-gesture dataset show TSFN achieving up to 96.1% accuracy and outperformance over ViViT, TCN, R(2+1)D, and CNN-based baselines, indicating strong potential for real-time HRI in service robots, drones, and search-and-rescue contexts.

Abstract

This paper presents a novel approach for ultra-range gesture recognition, addressing Human-Robot Interaction (HRI) challenges over extended distances. By leveraging human gestures in video data, we propose the Temporal-Spatiotemporal Fusion Network (TSFN) model that surpasses the limitations of current methods, enabling robots to understand gestures from long distances. With applications in service robots, search and rescue operations, and drone-based interactions, our approach enhances HRI in expansive environments. Experimental validation demonstrates significant advancements in gesture recognition accuracy, particularly in prolonged gesture sequences.
Paper Structure (6 sections, 8 equations, 2 figures, 1 table)

This paper contains 6 sections, 8 equations, 2 figures, 1 table.

Figures (2)

  • Figure 1: Six human gestures are depicted with two images each, illustrating the start and end of each gesture. The gestures, arranged from the top left, include beckoning, stop, null, thumbs-up, pointing, and thumbs-down.
  • Figure 2: Model performance vs. distance. The plot shows the accuracy of the TSFN model in recognizing gestures at various distances.