Ultra-Range Gesture Recognition using a Web-Camera in Human-Robot Interaction

Eran Bamani; Eden Nissinman; Inbar Meir; Lisa Koenigsberg; Avishai Sintov

Ultra-Range Gesture Recognition using a Web-Camera in Human-Robot Interaction

Eran Bamani, Eden Nissinman, Inbar Meir, Lisa Koenigsberg, Avishai Sintov

TL;DR

This study tackles Ultra-Range Gesture Recognition (URGR) for Human-Robot Interaction using only a single RGB camera, targeting distances up to 25 meters. It introduces HQ-Net, a task-specific single-image super-resolution model, and GViT, a fusion of Graph Convolutional Networks and Vision Transformers, to classify gestures from degraded, ultra-wide-range imagery. The approach is validated on a large, diverse dataset and through real-time robot experiments, achieving up to 98.1% gesture recognition accuracy at 25 meters and demonstrating reliable control of a quadruped robot in various environments. The work also shows HQ-Net outperforms existing SR methods in this context, and that cropping the user and quality-enhancement steps significantly boost recognition success. Overall, the framework provides a cost-effective, RGB-only solution for long-range gesture-based HRI with practical implications for service robots, search-and-rescue, drones, and outdoor robotics.

Abstract

Hand gestures play a significant role in human interactions where non-verbal intentions, thoughts and commands are conveyed. In Human-Robot Interaction (HRI), hand gestures offer a similar and efficient medium for conveying clear and rapid directives to a robotic agent. However, state-of-the-art vision-based methods for gesture recognition have been shown to be effective only up to a user-camera distance of seven meters. Such a short distance range limits practical HRI with, for example, service robots, search and rescue robots and drones. In this work, we address the Ultra-Range Gesture Recognition (URGR) problem by aiming for a recognition distance of up to 25 meters and in the context of HRI. We propose the URGR framework, a novel deep-learning, using solely a simple RGB camera. Gesture inference is based on a single image. First, a novel super-resolution model termed High-Quality Network (HQ-Net) uses a set of self-attention and convolutional layers to enhance the low-resolution image of the user. Then, we propose a novel URGR classifier termed Graph Vision Transformer (GViT) which takes the enhanced image as input. GViT combines the benefits of a Graph Convolutional Network (GCN) and a modified Vision Transformer (ViT). Evaluation of the proposed framework over diverse test data yields a high recognition rate of 98.1%. The framework has also exhibited superior performance compared to human recognition in ultra-range distances. With the framework, we analyze and demonstrate the performance of an autonomous quadruped robot directed by human gestures in complex ultra-range indoor and outdoor environments, acquiring 96% recognition rate on average.

Ultra-Range Gesture Recognition using a Web-Camera in Human-Robot Interaction

TL;DR

Abstract

Paper Structure (25 sections, 7 equations, 22 figures, 8 tables)

This paper contains 25 sections, 7 equations, 22 figures, 8 tables.

Introduction
Related Work
Gesture Recognition
Super Resolution
Methods
The Ultra-Range Recognition Problem
Data Collection
Image Quality Improvement
Pre-Processing
Super-Resolution Model
URGR Model
Graph Convolutional Networks
Vision Transformer
Graph-Vision Transformer (GViT)
Model Evaluation
...and 10 more sections

Figures (22)

Figure 1: A robot recognizing a directive gesture from a user located 25 meters away by solely using an RGB camera. Upon recognizing, for instance, a beckoning gesture, the robot will move toward the user.
Figure 2: Illustration scheme of the proposed URGR framework. The user in the image is detected with YOLOv3 followed by cropping the background. Since the user is in low quality due to the large distance from the camera, HQ-Net is a proposed super-resolution method that enhances the quality of the cropped image. Then, a classification model termed GViT outputs the predicted gesture.
Figure 3: Image examples of (a) pointing and (b) stop gestures showing different widths of the user. Hence, pixels around the user are added in order to maintain a constant image proportion.
Figure 4: Illustration of the HQ-Net model focusing on the user. A cropped image is the input to three pathways yielding a quality improved image $\hat{\mathbf{I}}$.
Figure 5: Architecture of the HQ layer used in the HQ-Net (Figure \ref{['fig:SR_gesture']}) for improving image quality of a user in ultra-range.
...and 17 more figures

Ultra-Range Gesture Recognition using a Web-Camera in Human-Robot Interaction

TL;DR

Abstract

Ultra-Range Gesture Recognition using a Web-Camera in Human-Robot Interaction

Authors

TL;DR

Abstract

Table of Contents

Figures (22)