Table of Contents
Fetching ...

Real-time Human Finger Pointing Recognition and Estimation for Robot Directives Using a Single Web-Camera

Eran Bamani, Eden Nissinman, Lisa Koenigsberg, Inbar Meir, Yoav Matalon, Avishai Sintov

TL;DR

This work addresses natural human-robot interaction by enabling robots to interpret pointing gestures from a single RGB camera in real-world indoor and outdoor environments. It introduces PointingNet, a three-part framework consisting of PointingNet-S for arm segmentation, PointingNet-E for estimating finger position and pointing direction, and a recognition stage that triggers estimation, with a seven-channel input C_j=[I_j,B_j,D_j] that fuses RGB, binary mask, and MiDaS depth. The approach achieves high angular accuracy (mean errors approaching or below a few degrees) and sub-meter finger-position accuracy, outperforming skeleton-based methods, and is validated on two robotic platforms via a ROS-based real-time pipeline. The work provides open-source datasets and a practical, scalable solution for real-time gesture-driven robot directives, with potential extensions to longer-range sensing, multimodal cues, and integrated verbal guidance for enhanced robustness.

Abstract

Gestures play a pivotal role in human communication, often serving as a preferred or complementary medium to verbal expression due to their superior spatial reference capabilities. A finger-pointing gesture conveys vital information regarding some point of interest in the environment. In Human-Robot Interaction (HRI), users can easily direct robots to target locations, facilitating tasks in diverse domains such as search and rescue or factory assistance. State-of-the-art approaches for visual pointing estimation often rely on depth cameras, are limited to indoor environments, and provide discrete predictions between limited targets. In this paper, we explore the development of models that enable robots to understand pointing directives from humans using a single web camera, even in diverse indoor and outdoor environments. A novel perception framework is proposed which includes a designated data-based model termed PointingNet. PointingNet recognizes the occurrence of pointing through classification followed by approximating the position and direction of the index finger with an advanced regression model. The model relies on a novel segmentation model for masking any lifted arm. While state-of-the-art human pose estimation models provide poor pointing angle estimation error of 28deg, PointingNet exhibits a mean error of less than 2deg. With the pointing information, the target location is computed, followed by robot motion planning and execution. The framework is evaluated on two robotic systems, demonstrating accurate target reaching.

Real-time Human Finger Pointing Recognition and Estimation for Robot Directives Using a Single Web-Camera

TL;DR

This work addresses natural human-robot interaction by enabling robots to interpret pointing gestures from a single RGB camera in real-world indoor and outdoor environments. It introduces PointingNet, a three-part framework consisting of PointingNet-S for arm segmentation, PointingNet-E for estimating finger position and pointing direction, and a recognition stage that triggers estimation, with a seven-channel input C_j=[I_j,B_j,D_j] that fuses RGB, binary mask, and MiDaS depth. The approach achieves high angular accuracy (mean errors approaching or below a few degrees) and sub-meter finger-position accuracy, outperforming skeleton-based methods, and is validated on two robotic platforms via a ROS-based real-time pipeline. The work provides open-source datasets and a practical, scalable solution for real-time gesture-driven robot directives, with potential extensions to longer-range sensing, multimodal cues, and integrated verbal guidance for enhanced robustness.

Abstract

Gestures play a pivotal role in human communication, often serving as a preferred or complementary medium to verbal expression due to their superior spatial reference capabilities. A finger-pointing gesture conveys vital information regarding some point of interest in the environment. In Human-Robot Interaction (HRI), users can easily direct robots to target locations, facilitating tasks in diverse domains such as search and rescue or factory assistance. State-of-the-art approaches for visual pointing estimation often rely on depth cameras, are limited to indoor environments, and provide discrete predictions between limited targets. In this paper, we explore the development of models that enable robots to understand pointing directives from humans using a single web camera, even in diverse indoor and outdoor environments. A novel perception framework is proposed which includes a designated data-based model termed PointingNet. PointingNet recognizes the occurrence of pointing through classification followed by approximating the position and direction of the index finger with an advanced regression model. The model relies on a novel segmentation model for masking any lifted arm. While state-of-the-art human pose estimation models provide poor pointing angle estimation error of 28deg, PointingNet exhibits a mean error of less than 2deg. With the pointing information, the target location is computed, followed by robot motion planning and execution. The framework is evaluated on two robotic systems, demonstrating accurate target reaching.
Paper Structure (30 sections, 10 equations, 26 figures, 6 tables)

This paper contains 30 sections, 10 equations, 26 figures, 6 tables.

Figures (26)

  • Figure 1: A user directs a quadruped robot to a target position by pointing. The robot observes the user through a web camera. PointingNet identifies a pointing gesture and estimates its position and direction. Once the target has been calculated, the robot plans motion and moves to the target.
  • Figure 2: Illustration of the proposed framework for a robot to reach a pointed target. Pointing recognition acts as a trigger for prompting directive motion. As long as the robot does not recognize a pointing gesture, it will remain in an idling mode. Once pointing of the user has been identified, the pointing position and direction are estimated. The estimations are given with respect to the coordinate frame of the camera on the robot. Hence, they are used by a motion planner for planning the motion of the robot from its current pose to the directed target.
  • Figure 3: Three approaches for measuring pointing direction: forearm vector formed by the wrist and elbow points, index finger vector, and the vector connecting the user eyes and finger. Pointing error $e_k$ is calculated according to the minimal distance between the target center point $\mathbf{g}$ and the calculated direction vector $\mathbf{v}_k$ passing through some key-point $\mathbf{p}_k$ where $k=\{\texttt{FA},\texttt{IF},\texttt{EF}\}$.
  • Figure 4: Pointing accuracy with regards to the distance of the user from the pointed target for three measurement approaches.
  • Figure 5: Illustration of the position $\mathbf{p}$ and direction $\hat{\mathbf{x}}$ definitions of a pointing finger with respect to the coordinate frame of the camera $\mathcal{O}_c$. Direction can also be represented by angles $\beta$ and $\gamma$.
  • ...and 21 more figures