Table of Contents
Fetching ...

Gaze-Aware Task Progression Detection Framework for Human-Robot Interaction Using RGB Cameras

Linlin Cheng, Koen Hindriks, Artem V. Belopolsky

Abstract

In human-robot interaction (HRI), detecting a human's gaze helps robots interpret user attention and intent. However, most gaze detection approaches rely on specialized eye-tracking hardware, limiting deployment in everyday settings. Appearance-based gaze estimation methods remove this dependency by using standard RGB cameras, but their practicality in HRI remains underexplored. We present a calibration-free framework for detecting task progression when information is conveyed via integrated display interfaces. The framework uses only the robot's built-in monocular RGB camera (640x480 resolution) and state-of-the-art gaze estimation to monitor attention patterns. It leverages natural behavior, where users shift focus from task interfaces to the robot's face to signal task completion, formalized through three Areas of Interest (AOI): tablet, robot face, and elsewhere. Systematic parameter optimization identifies configurations that balance detection accuracy and interaction latency. We validate our framework in a "First Day at Work" scenario, comparing it to button-based interaction. Results show a task completion detection accuracy of 77.6%. Compared to button-based interaction, the proposed system exhibits slightly higher response latency but preserves information retention and significantly improves comfort, social presence, and perceived naturalness. Notably, most participants reported that they did not consciously use eye movements to guide the interaction, underscoring the intuitive role of gaze as a communicative cue. This work demonstrates the feasibility of intuitive, low-cost, RGB-only gaze-based HRI for natural and engaging interactions.

Gaze-Aware Task Progression Detection Framework for Human-Robot Interaction Using RGB Cameras

Abstract

In human-robot interaction (HRI), detecting a human's gaze helps robots interpret user attention and intent. However, most gaze detection approaches rely on specialized eye-tracking hardware, limiting deployment in everyday settings. Appearance-based gaze estimation methods remove this dependency by using standard RGB cameras, but their practicality in HRI remains underexplored. We present a calibration-free framework for detecting task progression when information is conveyed via integrated display interfaces. The framework uses only the robot's built-in monocular RGB camera (640x480 resolution) and state-of-the-art gaze estimation to monitor attention patterns. It leverages natural behavior, where users shift focus from task interfaces to the robot's face to signal task completion, formalized through three Areas of Interest (AOI): tablet, robot face, and elsewhere. Systematic parameter optimization identifies configurations that balance detection accuracy and interaction latency. We validate our framework in a "First Day at Work" scenario, comparing it to button-based interaction. Results show a task completion detection accuracy of 77.6%. Compared to button-based interaction, the proposed system exhibits slightly higher response latency but preserves information retention and significantly improves comfort, social presence, and perceived naturalness. Notably, most participants reported that they did not consciously use eye movements to guide the interaction, underscoring the intuitive role of gaze as a communicative cue. This work demonstrates the feasibility of intuitive, low-cost, RGB-only gaze-based HRI for natural and engaging interactions.
Paper Structure (22 sections, 5 equations, 7 figures, 1 table)

This paper contains 22 sections, 5 equations, 7 figures, 1 table.

Figures (7)

  • Figure 1: The overview of the proposed framework. The system uses the humanoid robot's built-in monocular camera to continuously track human facial features and estimate gaze direction using a pretrained deep learning model. The resulting 3D gaze vector undergoes smoothing filtration to reduce noise, then is mathematically projected onto the robot's 2D interaction plane and mapped to predefined AOIs. In the state detection algorithm(left flowchart), the user’s level of engagement was determined by analyzing their gaze patterns over time within predefined AOIs using a simple threshold-based approach.
  • Figure 2: The relationship between 3D gaze directions and their projection onto a 2D screen
  • Figure 3: Flowchart of the experimental procedure for two interaction conditions, gaze-based and button-based. The gaze-based condition was presented to participants as a “no-button” interaction to conceal the underlying gaze detection mechanism.
  • Figure 4: Example of a gaze-based interaction task. Top: Smoothed pitch (blue) and yaw (red) over time. The semi-transparent blue areas indicate periods when the robot displayed Page 1 and Page 2. Bottom: Frame sequences from real-time recordings corresponding to three moments ($t_1$, $t_2$, $t_3$) in the top plot. Panels (a) and (b) show views from a third-person perspective (a) and the robot's perspective (b). In (b), the arrows represent the original 3D gaze (red) and the smoothed gaze (green). These three moments illustrate the transition from Page1 to Page2 using the proposed framework.
  • Figure 5: Success Rate Distribution of the Proposed System Across Participants. The violin plot shows the probability density of success rates, with wider sections indicating where more participants cluster. Individual data points (blue dots) represent each participant's success rate, calculated as the percentage of successful state changes detected across all trials. The red line indicates the mean success rate, while the orange line shows the median. The main reasons for the four participants with lower success rates are indicated in the figure with arrows and text.
  • ...and 2 more figures