Table of Contents
Fetching ...

Gaze estimation learning architecture as support to affective, social and cognitive studies in natural human-robot interaction

Maria Lombardi, Elisa Maiettini, Agnieszka Wykowska, Lorenzo Natale

TL;DR

A learning robotic architecture estimating the human gaze direction in table-top scenarios without any external hardware is proposed to provide a valuable support in studies where external hardware might represent an obstacle to spontaneous human behaviour.

Abstract

Gaze is a crucial social cue in any interacting scenario and drives many mechanisms of social cognition (joint and shared attention, predicting human intention, coordination tasks). Gaze direction is an indication of social and emotional functions affecting the way the emotions are perceived. Evidence shows that embodied humanoid robots endowing social abilities can be seen as sophisticated stimuli to unravel many mechanisms of human social cognition while increasing engagement and ecological validity. In this context, building a robotic perception system to automatically estimate the human gaze only relying on robot's sensors is still demanding. Main goal of the paper is to propose a learning robotic architecture estimating the human gaze direction in table-top scenarios without any external hardware. Table-top tasks are largely used in many studies in experimental psychology because they are suitable to implement numerous scenarios allowing agents to collaborate while maintaining a face-to-face interaction. Such an architecture can provide a valuable support in studies where external hardware might represent an obstacle to spontaneous human behaviour, especially in environments less controlled than the laboratory (e.g., in clinical settings). A novel dataset was also collected with the humanoid robot iCub, including images annotated from 24 participants in different gaze conditions.

Gaze estimation learning architecture as support to affective, social and cognitive studies in natural human-robot interaction

TL;DR

A learning robotic architecture estimating the human gaze direction in table-top scenarios without any external hardware is proposed to provide a valuable support in studies where external hardware might represent an obstacle to spontaneous human behaviour.

Abstract

Gaze is a crucial social cue in any interacting scenario and drives many mechanisms of social cognition (joint and shared attention, predicting human intention, coordination tasks). Gaze direction is an indication of social and emotional functions affecting the way the emotions are perceived. Evidence shows that embodied humanoid robots endowing social abilities can be seen as sophisticated stimuli to unravel many mechanisms of human social cognition while increasing engagement and ecological validity. In this context, building a robotic perception system to automatically estimate the human gaze only relying on robot's sensors is still demanding. Main goal of the paper is to propose a learning robotic architecture estimating the human gaze direction in table-top scenarios without any external hardware. Table-top tasks are largely used in many studies in experimental psychology because they are suitable to implement numerous scenarios allowing agents to collaborate while maintaining a face-to-face interaction. Such an architecture can provide a valuable support in studies where external hardware might represent an obstacle to spontaneous human behaviour, especially in environments less controlled than the laboratory (e.g., in clinical settings). A novel dataset was also collected with the humanoid robot iCub, including images annotated from 24 participants in different gaze conditions.

Paper Structure

This paper contains 19 sections, 5 figures, 3 tables.

Figures (5)

  • Figure 1: Setup.(a) Overall setup consisting in an aruco board, iCub robot with mounted a Realsense camera on its head and an external camera on a tripod. (b) Focus on iCub robot with aruco markers attached on its body.
  • Figure 2: Dataset collection.(a) Sample frames acquired during the mutual/averted gaze session of the data collection. The participant was asked to look at the robot face in different head position (e.g. frontally and rotating the head) and do not look at the robot (averted gaze). (b) Samples frames acquired while the participant was looking at the robot's body part (e.g. left forearm). Different views were taken of the human (frontally and back) in order to show the detection of the aruco markers both from the iCub's eye and external Realsense camera. (c) Samples frame acquired while the participant was looking at the aruco board placed on the table. The detection of the aruco board in shown also in this scenario.
  • Figure 3: Dataset annotation. Pictorial representations of the data annotation process reporting the geometric transformations to have the target point in the camera coordinate system (CCS) passing through different coordinate systems (WCS, BCS and REF) used as support. (a) Data annotation for the scenario "Gazing at workspace". (b) Data annotation for the scenario "Gazing at iCub's body parts". The target point $P_t$ is marked in red, the dashed white lines indicate the transformation between the different coordinate systems whereas the solid white line indicate the point in the corresponding system. Finally, the resulting point of interest is underlined.
  • Figure 4: Learning architecture. The feature vector extracted by OpenPose is used as input to the multiclass classifier whose output is the pair $\left(r,c\right)$, where $r\in$ (eye_contact, other, iCub, workspace) and $c$ is the confidence level. Only if the multiclass classifier has the class "workspace" as output, the feature vector is forwarded to the second layer of the architecture and so to the gaze regressor. The final output is the vector $\left(x,y\right)$ representing the 2D gaze vector with the corresponding confidence level.
  • Figure 5: Architecture's output. Sample frames taken from the test set to show the output of the learning pipeline in the different considered scenarios: human in mutual gaze, human in averted gaze, human looking at robot's parts and human looking at the workspace. For each frame the multi-class classifier's output is reported with the corresponding confidence level. For the frames classified as workspace, also the predicted gaze vector and the reconstructed coordinates of the gaze in camera frame reference are reported.