Dynamic Gesture Recognition in Ultra-Range Distance for Effective Human-Robot Interaction

Eran Bamani Beeri; Eden Nissinman; Avishai Sintov

Dynamic Gesture Recognition in Ultra-Range Distance for Effective Human-Robot Interaction

Eran Bamani Beeri, Eden Nissinman, Avishai Sintov

TL;DR

The paper tackles ultra-range gesture recognition for Human-Robot Interaction by extending gesture understanding to distances up to $[4,28]$ meters. It introduces the Temporal-Spatiotemporal Fusion Network (TSFN), which combines Temporal Convolutional Networks (TCN) and $R(2+1)D$ convolutions to capture temporal and spatiotemporal cues in video data. A distance-aware composite loss $\mathcal{L} = \mathcal{L}_{CE} + \alpha \mathcal{L}_{global} + \beta \mathcal{L}_{dist} + \gamma \mathcal{L}_{robust}$, with $\mathcal{L}_{dist} = \frac{1}{N} \sum d_i \cdot \mathcal{L}_{CE}(v_i, y_i)$, penalizes errors more at longer distances and promotes robustness. Empirical results on a six-gesture dataset show TSFN achieving up to 96.1% accuracy and outperformance over ViViT, TCN, R(2+1)D, and CNN-based baselines, indicating strong potential for real-time HRI in service robots, drones, and search-and-rescue contexts.

Abstract

This paper presents a novel approach for ultra-range gesture recognition, addressing Human-Robot Interaction (HRI) challenges over extended distances. By leveraging human gestures in video data, we propose the Temporal-Spatiotemporal Fusion Network (TSFN) model that surpasses the limitations of current methods, enabling robots to understand gestures from long distances. With applications in service robots, search and rescue operations, and drone-based interactions, our approach enhances HRI in expansive environments. Experimental validation demonstrates significant advancements in gesture recognition accuracy, particularly in prolonged gesture sequences.

Dynamic Gesture Recognition in Ultra-Range Distance for Effective Human-Robot Interaction

TL;DR

The paper tackles ultra-range gesture recognition for Human-Robot Interaction by extending gesture understanding to distances up to

meters. It introduces the Temporal-Spatiotemporal Fusion Network (TSFN), which combines Temporal Convolutional Networks (TCN) and

convolutions to capture temporal and spatiotemporal cues in video data. A distance-aware composite loss

, with

, penalizes errors more at longer distances and promotes robustness. Empirical results on a six-gesture dataset show TSFN achieving up to 96.1% accuracy and outperformance over ViViT, TCN, R(2+1)D, and CNN-based baselines, indicating strong potential for real-time HRI in service robots, drones, and search-and-rescue contexts.

Abstract

Paper Structure (6 sections, 8 equations, 2 figures, 1 table)

This paper contains 6 sections, 8 equations, 2 figures, 1 table.

Introduction
Methods
Problem Formulation and Data Collection
Models
Experimental Results
Conclusion

Figures (2)

Figure 1: Six human gestures are depicted with two images each, illustrating the start and end of each gesture. The gestures, arranged from the top left, include beckoning, stop, null, thumbs-up, pointing, and thumbs-down.
Figure 2: Model performance vs. distance. The plot shows the accuracy of the TSFN model in recognizing gestures at various distances.

Dynamic Gesture Recognition in Ultra-Range Distance for Effective Human-Robot Interaction

TL;DR

Abstract

Dynamic Gesture Recognition in Ultra-Range Distance for Effective Human-Robot Interaction

Authors

TL;DR

Abstract

Table of Contents

Figures (2)