Gaze-Aware Task Progression Detection Framework for Human-Robot Interaction Using RGB Cameras

Linlin Cheng; Koen Hindriks; Artem V. Belopolsky

Gaze-Aware Task Progression Detection Framework for Human-Robot Interaction Using RGB Cameras

Linlin Cheng, Koen Hindriks, Artem V. Belopolsky

Abstract

In human-robot interaction (HRI), detecting a human's gaze helps robots interpret user attention and intent. However, most gaze detection approaches rely on specialized eye-tracking hardware, limiting deployment in everyday settings. Appearance-based gaze estimation methods remove this dependency by using standard RGB cameras, but their practicality in HRI remains underexplored. We present a calibration-free framework for detecting task progression when information is conveyed via integrated display interfaces. The framework uses only the robot's built-in monocular RGB camera (640x480 resolution) and state-of-the-art gaze estimation to monitor attention patterns. It leverages natural behavior, where users shift focus from task interfaces to the robot's face to signal task completion, formalized through three Areas of Interest (AOI): tablet, robot face, and elsewhere. Systematic parameter optimization identifies configurations that balance detection accuracy and interaction latency. We validate our framework in a "First Day at Work" scenario, comparing it to button-based interaction. Results show a task completion detection accuracy of 77.6%. Compared to button-based interaction, the proposed system exhibits slightly higher response latency but preserves information retention and significantly improves comfort, social presence, and perceived naturalness. Notably, most participants reported that they did not consciously use eye movements to guide the interaction, underscoring the intuitive role of gaze as a communicative cue. This work demonstrates the feasibility of intuitive, low-cost, RGB-only gaze-based HRI for natural and engaging interactions.

Gaze-Aware Task Progression Detection Framework for Human-Robot Interaction Using RGB Cameras

Abstract

Paper Structure (22 sections, 5 equations, 7 figures, 1 table)

This paper contains 22 sections, 5 equations, 7 figures, 1 table.

INTRODUCTION
Related work
Method
Gaze estimation model
Data smoothing
3D-to-2D Gaze Projection
Area of Interest (AOI) Mapping
State Detection Algorithm
Parameter Configuration
Experiment
Experimental Setup
Procedure
Participant
Result
System Accuracy
...and 7 more sections

Figures (7)

Figure 1: The overview of the proposed framework. The system uses the humanoid robot's built-in monocular camera to continuously track human facial features and estimate gaze direction using a pretrained deep learning model. The resulting 3D gaze vector undergoes smoothing filtration to reduce noise, then is mathematically projected onto the robot's 2D interaction plane and mapped to predefined AOIs. In the state detection algorithm(left flowchart), the user’s level of engagement was determined by analyzing their gaze patterns over time within predefined AOIs using a simple threshold-based approach.
Figure 2: The relationship between 3D gaze directions and their projection onto a 2D screen
Figure 3: Flowchart of the experimental procedure for two interaction conditions, gaze-based and button-based. The gaze-based condition was presented to participants as a “no-button” interaction to conceal the underlying gaze detection mechanism.
Figure 4: Example of a gaze-based interaction task. Top: Smoothed pitch (blue) and yaw (red) over time. The semi-transparent blue areas indicate periods when the robot displayed Page 1 and Page 2. Bottom: Frame sequences from real-time recordings corresponding to three moments ($t_1$, $t_2$, $t_3$) in the top plot. Panels (a) and (b) show views from a third-person perspective (a) and the robot's perspective (b). In (b), the arrows represent the original 3D gaze (red) and the smoothed gaze (green). These three moments illustrate the transition from Page1 to Page2 using the proposed framework.
Figure 5: Success Rate Distribution of the Proposed System Across Participants. The violin plot shows the probability density of success rates, with wider sections indicating where more participants cluster. Individual data points (blue dots) represent each participant's success rate, calculated as the percentage of successful state changes detected across all trials. The red line indicates the mean success rate, while the orange line shows the median. The main reasons for the four participants with lower success rates are indicated in the figure with arrows and text.
...and 2 more figures

Gaze-Aware Task Progression Detection Framework for Human-Robot Interaction Using RGB Cameras

Abstract

Gaze-Aware Task Progression Detection Framework for Human-Robot Interaction Using RGB Cameras

Authors

Abstract

Table of Contents

Figures (7)