Dynamic Gesture Recognition in Ultra-Range Distance for Effective Human-Robot Interaction
Eran Bamani Beeri, Eden Nissinman, Avishai Sintov
TL;DR
The paper tackles ultra-range gesture recognition for Human-Robot Interaction by extending gesture understanding to distances up to $[4,28]$ meters. It introduces the Temporal-Spatiotemporal Fusion Network (TSFN), which combines Temporal Convolutional Networks (TCN) and $R(2+1)D$ convolutions to capture temporal and spatiotemporal cues in video data. A distance-aware composite loss $\mathcal{L} = \mathcal{L}_{CE} + \alpha \mathcal{L}_{global} + \beta \mathcal{L}_{dist} + \gamma \mathcal{L}_{robust}$, with $\mathcal{L}_{dist} = \frac{1}{N} \sum d_i \cdot \mathcal{L}_{CE}(v_i, y_i)$, penalizes errors more at longer distances and promotes robustness. Empirical results on a six-gesture dataset show TSFN achieving up to 96.1% accuracy and outperformance over ViViT, TCN, R(2+1)D, and CNN-based baselines, indicating strong potential for real-time HRI in service robots, drones, and search-and-rescue contexts.
Abstract
This paper presents a novel approach for ultra-range gesture recognition, addressing Human-Robot Interaction (HRI) challenges over extended distances. By leveraging human gestures in video data, we propose the Temporal-Spatiotemporal Fusion Network (TSFN) model that surpasses the limitations of current methods, enabling robots to understand gestures from long distances. With applications in service robots, search and rescue operations, and drone-based interactions, our approach enhances HRI in expansive environments. Experimental validation demonstrates significant advancements in gesture recognition accuracy, particularly in prolonged gesture sequences.
