Real-time Human Finger Pointing Recognition and Estimation for Robot Directives Using a Single Web-Camera
Eran Bamani, Eden Nissinman, Lisa Koenigsberg, Inbar Meir, Yoav Matalon, Avishai Sintov
TL;DR
This work addresses natural human-robot interaction by enabling robots to interpret pointing gestures from a single RGB camera in real-world indoor and outdoor environments. It introduces PointingNet, a three-part framework consisting of PointingNet-S for arm segmentation, PointingNet-E for estimating finger position and pointing direction, and a recognition stage that triggers estimation, with a seven-channel input C_j=[I_j,B_j,D_j] that fuses RGB, binary mask, and MiDaS depth. The approach achieves high angular accuracy (mean errors approaching or below a few degrees) and sub-meter finger-position accuracy, outperforming skeleton-based methods, and is validated on two robotic platforms via a ROS-based real-time pipeline. The work provides open-source datasets and a practical, scalable solution for real-time gesture-driven robot directives, with potential extensions to longer-range sensing, multimodal cues, and integrated verbal guidance for enhanced robustness.
Abstract
Gestures play a pivotal role in human communication, often serving as a preferred or complementary medium to verbal expression due to their superior spatial reference capabilities. A finger-pointing gesture conveys vital information regarding some point of interest in the environment. In Human-Robot Interaction (HRI), users can easily direct robots to target locations, facilitating tasks in diverse domains such as search and rescue or factory assistance. State-of-the-art approaches for visual pointing estimation often rely on depth cameras, are limited to indoor environments, and provide discrete predictions between limited targets. In this paper, we explore the development of models that enable robots to understand pointing directives from humans using a single web camera, even in diverse indoor and outdoor environments. A novel perception framework is proposed which includes a designated data-based model termed PointingNet. PointingNet recognizes the occurrence of pointing through classification followed by approximating the position and direction of the index finger with an advanced regression model. The model relies on a novel segmentation model for masking any lifted arm. While state-of-the-art human pose estimation models provide poor pointing angle estimation error of 28deg, PointingNet exhibits a mean error of less than 2deg. With the pointing information, the target location is computed, followed by robot motion planning and execution. The framework is evaluated on two robotic systems, demonstrating accurate target reaching.
