Real-time Human Finger Pointing Recognition and Estimation for Robot Directives Using a Single Web-Camera

Eran Bamani; Eden Nissinman; Lisa Koenigsberg; Inbar Meir; Yoav Matalon; Avishai Sintov

Real-time Human Finger Pointing Recognition and Estimation for Robot Directives Using a Single Web-Camera

Eran Bamani, Eden Nissinman, Lisa Koenigsberg, Inbar Meir, Yoav Matalon, Avishai Sintov

TL;DR

This work addresses natural human-robot interaction by enabling robots to interpret pointing gestures from a single RGB camera in real-world indoor and outdoor environments. It introduces PointingNet, a three-part framework consisting of PointingNet-S for arm segmentation, PointingNet-E for estimating finger position and pointing direction, and a recognition stage that triggers estimation, with a seven-channel input C_j=[I_j,B_j,D_j] that fuses RGB, binary mask, and MiDaS depth. The approach achieves high angular accuracy (mean errors approaching or below a few degrees) and sub-meter finger-position accuracy, outperforming skeleton-based methods, and is validated on two robotic platforms via a ROS-based real-time pipeline. The work provides open-source datasets and a practical, scalable solution for real-time gesture-driven robot directives, with potential extensions to longer-range sensing, multimodal cues, and integrated verbal guidance for enhanced robustness.

Abstract

Gestures play a pivotal role in human communication, often serving as a preferred or complementary medium to verbal expression due to their superior spatial reference capabilities. A finger-pointing gesture conveys vital information regarding some point of interest in the environment. In Human-Robot Interaction (HRI), users can easily direct robots to target locations, facilitating tasks in diverse domains such as search and rescue or factory assistance. State-of-the-art approaches for visual pointing estimation often rely on depth cameras, are limited to indoor environments, and provide discrete predictions between limited targets. In this paper, we explore the development of models that enable robots to understand pointing directives from humans using a single web camera, even in diverse indoor and outdoor environments. A novel perception framework is proposed which includes a designated data-based model termed PointingNet. PointingNet recognizes the occurrence of pointing through classification followed by approximating the position and direction of the index finger with an advanced regression model. The model relies on a novel segmentation model for masking any lifted arm. While state-of-the-art human pose estimation models provide poor pointing angle estimation error of 28deg, PointingNet exhibits a mean error of less than 2deg. With the pointing information, the target location is computed, followed by robot motion planning and execution. The framework is evaluated on two robotic systems, demonstrating accurate target reaching.

Real-time Human Finger Pointing Recognition and Estimation for Robot Directives Using a Single Web-Camera

TL;DR

Abstract

Paper Structure (30 sections, 10 equations, 26 figures, 6 tables)

This paper contains 30 sections, 10 equations, 26 figures, 6 tables.

Introduction
Related Work
Visual Perception with Depth Cameras
Visual perception with a single RGB camera
Perception using Human Pose Estimation Models
Alternative Body-Based Approaches for Pointing Estimation
Pointing Recognition via Wearable Devices
State-of-the-art Summary
Preliminary Pointing Analysis
Methods
Problem Formulation
Overview of Approach
Data Collection
Pointing Segmentation
Pointing Recognition
...and 15 more sections

Figures (26)

Figure 1: A user directs a quadruped robot to a target position by pointing. The robot observes the user through a web camera. PointingNet identifies a pointing gesture and estimates its position and direction. Once the target has been calculated, the robot plans motion and moves to the target.
Figure 2: Illustration of the proposed framework for a robot to reach a pointed target. Pointing recognition acts as a trigger for prompting directive motion. As long as the robot does not recognize a pointing gesture, it will remain in an idling mode. Once pointing of the user has been identified, the pointing position and direction are estimated. The estimations are given with respect to the coordinate frame of the camera on the robot. Hence, they are used by a motion planner for planning the motion of the robot from its current pose to the directed target.
Figure 3: Three approaches for measuring pointing direction: forearm vector formed by the wrist and elbow points, index finger vector, and the vector connecting the user eyes and finger. Pointing error $e_k$ is calculated according to the minimal distance between the target center point $\mathbf{g}$ and the calculated direction vector $\mathbf{v}_k$ passing through some key-point $\mathbf{p}_k$ where $k=\{\texttt{FA},\texttt{IF},\texttt{EF}\}$.
Figure 4: Pointing accuracy with regards to the distance of the user from the pointed target for three measurement approaches.
Figure 5: Illustration of the position $\mathbf{p}$ and direction $\hat{\mathbf{x}}$ definitions of a pointing finger with respect to the coordinate frame of the camera $\mathcal{O}_c$. Direction can also be represented by angles $\beta$ and $\gamma$.
...and 21 more figures

Real-time Human Finger Pointing Recognition and Estimation for Robot Directives Using a Single Web-Camera

TL;DR

Abstract

Real-time Human Finger Pointing Recognition and Estimation for Robot Directives Using a Single Web-Camera

Authors

TL;DR

Abstract

Table of Contents

Figures (26)