Table of Contents
Fetching ...

Gaze Estimation for Human-Robot Interaction: Analysis Using the NICO Platform

Matej Palider, Omar Eldardeer, Viktor Kocur

TL;DR

This paper addresses the practical evaluation of gaze estimation in human-robot interaction within a shared workspace. It introduces an annotated dataset collected on the NICO platform and benchmarks four state-of-the-art gaze models using stereo-camera data to estimate gaze on a workspace plane. The key finding is that while angular accuracy aligns with general benchmarks, planar gaze localization remains relatively imprecise (median about $16.48\,\text{cm}$), informing how gaze should be integrated as a modality in HRI systems and highlighting the value of multimodal cues and temporal aggregation. The work provides actionable recommendations for incorporating gaze information in HRI and releases both the dataset and evaluation code for reproducibility.

Abstract

This paper evaluates the current gaze estimation methods within an HRI context of a shared workspace scenario. We introduce a new, annotated dataset collected with the NICO robotic platform. We evaluate four state-of-the-art gaze estimation models. The evaluation shows that the angular errors are close to those reported on general-purpose benchmarks. However, when expressed in terms of distance in the shared workspace the best median error is 16.48 cm quantifying the practical limitations of current methods. We conclude by discussing these limitations and offering recommendations on how to best integrate gaze estimation as a modality in HRI systems.

Gaze Estimation for Human-Robot Interaction: Analysis Using the NICO Platform

TL;DR

This paper addresses the practical evaluation of gaze estimation in human-robot interaction within a shared workspace. It introduces an annotated dataset collected on the NICO platform and benchmarks four state-of-the-art gaze models using stereo-camera data to estimate gaze on a workspace plane. The key finding is that while angular accuracy aligns with general benchmarks, planar gaze localization remains relatively imprecise (median about ), informing how gaze should be integrated as a modality in HRI systems and highlighting the value of multimodal cues and temporal aggregation. The work provides actionable recommendations for incorporating gaze information in HRI and releases both the dataset and evaluation code for reproducibility.

Abstract

This paper evaluates the current gaze estimation methods within an HRI context of a shared workspace scenario. We introduce a new, annotated dataset collected with the NICO robotic platform. We evaluate four state-of-the-art gaze estimation models. The evaluation shows that the angular errors are close to those reported on general-purpose benchmarks. However, when expressed in terms of distance in the shared workspace the best median error is 16.48 cm quantifying the practical limitations of current methods. We conclude by discussing these limitations and offering recommendations on how to best integrate gaze estimation as a modality in HRI systems.

Paper Structure

This paper contains 15 sections, 4 figures, 1 table.

Figures (4)

  • Figure 1: a) The experimental setup. Human participant sits in front of the robot. A display is embedded within the workspace shared by the robot and the participant. In our study we consider the distance between the estimated and ground truth gaze point on the shared work surface. b) View from the robot's left camera. c) The grid shown on the display.
  • Figure 2: Example of images used for stereo-calibration. Left: High field-of-view cameras are significantly distorted. Right: With known camera intrinsics it is possible to rectify the images.
  • Figure 3: The cumulative distribution of the angular (left) and the distance (right) errors. The plots show the precision achieved (y-axis) given an error threshold (x-axis).
  • Figure 4: The distributions of the yaw and pitch angles for the ground truth and predicted gaze. For ground truth, we consider the direction from the head position estimated using RetinaFace deng2020retinaface to the center of the target square. For visualization, we allow the pitch angles to go below -90$^\circ$.