Table of Contents
Fetching ...

Predicting the Intention to Interact with a Service Robot:the Role of Gaze Cues

Simone Arreghini, Gabriele Abbate, Alessandro Giusti, Antonio Paolillo

TL;DR

This work tackles early prediction of a nearby person’s intention to interact with a service robot by fusing body pose, facial landmarks, and gaze cues in a self-supervised sequence-to-sequence classifier. It demonstrates substantial gains in frame-level and sequence-level performance, with AUROC rising from 84.5% to 91.2% and a mean advance detection distance increase from 2.4 m to 3.27 m when gaze cues are incorporated. The authors validate mutual gaze estimation at long ranges and show that the model can adapt to new environments without external supervision, including real-world waiter-robot demonstrations. The results highlight the practical impact of gaze-aware perception for proactive, user-friendly robotic interactions in diverse settings.

Abstract

For a service robot, it is crucial to perceive as early as possible that an approaching person intends to interact: in this case, it can proactively enact friendly behaviors that lead to an improved user experience. We solve this perception task with a sequence-to-sequence classifier of a potential user intention to interact, which can be trained in a self-supervised way. Our main contribution is a study of the benefit of features representing the person's gaze in this context. Extensive experiments on a novel dataset show that the inclusion of gaze cues significantly improves the classifier performance (AUROC increases from 84.5% to 91.2%); the distance at which an accurate classification can be achieved improves from 2.4 m to 3.2 m. We also quantify the system's ability to adapt to new environments without external supervision. Qualitative experiments show practical applications with a waiter robot.

Predicting the Intention to Interact with a Service Robot:the Role of Gaze Cues

TL;DR

This work tackles early prediction of a nearby person’s intention to interact with a service robot by fusing body pose, facial landmarks, and gaze cues in a self-supervised sequence-to-sequence classifier. It demonstrates substantial gains in frame-level and sequence-level performance, with AUROC rising from 84.5% to 91.2% and a mean advance detection distance increase from 2.4 m to 3.27 m when gaze cues are incorporated. The authors validate mutual gaze estimation at long ranges and show that the model can adapt to new environments without external supervision, including real-world waiter-robot demonstrations. The results highlight the practical impact of gaze-aware perception for proactive, user-friendly robotic interactions in diverse settings.

Abstract

For a service robot, it is crucial to perceive as early as possible that an approaching person intends to interact: in this case, it can proactively enact friendly behaviors that lead to an improved user experience. We solve this perception task with a sequence-to-sequence classifier of a potential user intention to interact, which can be trained in a self-supervised way. Our main contribution is a study of the benefit of features representing the person's gaze in this context. Extensive experiments on a novel dataset show that the inclusion of gaze cues significantly improves the classifier performance (AUROC increases from 84.5% to 91.2%); the distance at which an accurate classification can be achieved improves from 2.4 m to 3.2 m. We also quantify the system's ability to adapt to new environments without external supervision. Qualitative experiments show practical applications with a waiter robot.
Paper Structure (18 sections, 6 equations, 8 figures)

This paper contains 18 sections, 6 equations, 8 figures.

Figures (8)

  • Figure 1: A service robot predicts if a nearby person intends to interact, so to proactively enact a friendly behavior.
  • Figure 2: System architecture.
  • Figure 3: AUROC of rf and lstm classifiers with the different feature sets.
  • Figure 4: AUROC for the LSTM using $\bm{f}_\text{CH}$ (left) and $\bm{f}_\text{FULL}$ (right) for different human-robot distance quantiles.
  • Figure 5: Median distance to the robot (left) and median predicted probability of interaction (center) as a function of time. Time $t=0$ is defined for each sequence as the moment when the subject either interacts, for positive sequences (dashed line), or the moment in which the subject is closest to the robot, for negative sequences (continuous line). The rightmost plot reports the predicted probability of interaction as a function of distance to the camera, ignoring negative samples with $t>0$. Shaded areas represent the interquartile range.
  • ...and 3 more figures