Table of Contents
Fetching ...

Anticipation through Head Pose Estimation: a preliminary study

Federico Figari Tomenotti, Nicoletta Noceti

TL;DR

The paper investigates whether 3D head pose can anticipate action goals during reaching and transporting tasks. It introduces a geometry-based, video-based pipeline that estimates head pose from 5 facial landmarks using HHP-Net, integrates object and body detections (via Centernet and YOLOv8), and reasons about the spatio-temporal relationships among head, hand, and scene objects to identify gazing_target_time, touching_object_time, and target_object_time. Anticipation is quantified as the temporal offset between gazing and action milestones, with findings showing an average head-led anticipation of $0.5$ s (approximately $15$ frames at $30fps$) across actions, and modulation by object positioning. This work provides a practical step toward online, non-verbal cue-based anticipation to improve human-robot interaction, laying groundwork for predictive models in social robotics.

Abstract

The ability to anticipate others' goals and intentions is at the basis of human-human social interaction. Such ability, largely based on non-verbal communication, is also a key to having natural and pleasant interactions with artificial agents, like robots. In this work, we discuss a preliminary experiment on the use of head pose as a visual cue to understand and anticipate action goals, particularly reaching and transporting movements. By reasoning on the spatio-temporal connections between the head, hands and objects in the scene, we will show that short-range anticipation is possible, laying the foundations for future applications to human-robot interaction.

Anticipation through Head Pose Estimation: a preliminary study

TL;DR

The paper investigates whether 3D head pose can anticipate action goals during reaching and transporting tasks. It introduces a geometry-based, video-based pipeline that estimates head pose from 5 facial landmarks using HHP-Net, integrates object and body detections (via Centernet and YOLOv8), and reasons about the spatio-temporal relationships among head, hand, and scene objects to identify gazing_target_time, touching_object_time, and target_object_time. Anticipation is quantified as the temporal offset between gazing and action milestones, with findings showing an average head-led anticipation of s (approximately frames at ) across actions, and modulation by object positioning. This work provides a practical step toward online, non-verbal cue-based anticipation to improve human-robot interaction, laying groundwork for predictive models in social robotics.

Abstract

The ability to anticipate others' goals and intentions is at the basis of human-human social interaction. Such ability, largely based on non-verbal communication, is also a key to having natural and pleasant interactions with artificial agents, like robots. In this work, we discuss a preliminary experiment on the use of head pose as a visual cue to understand and anticipate action goals, particularly reaching and transporting movements. By reasoning on the spatio-temporal connections between the head, hands and objects in the scene, we will show that short-range anticipation is possible, laying the foundations for future applications to human-robot interaction.
Paper Structure (6 sections, 5 figures, 1 table)

This paper contains 6 sections, 5 figures, 1 table.

Figures (5)

  • Figure 1: (a,b,c): reaching from transporting action.\ref{['figSub_sfig1']} The head starts to move. \ref{['figSub_sfig2']} The head projection onto the table reached the target. \ref{['figSub_sfig3']} The hand reached the target after more than 10 frames (1/3 of a second). (d,e,f): transporting from transporting action. \ref{['figSub_transport_frames1']} The head and hand are aligned. \ref{['figSub_transport_frames2']} The head projection onto the table reached the target at the very beginning of the hand movements. \ref{['figSub_transport_frames3']} The hand reached the target. In light blue/violet bounding boxes and annotations are shown. The image is in black-white only for clarity.
  • Figure 2: Reaching the bottle in a transporting action. The vertical lines indicate the gazing_target_time (red) and the touching_object_time (blue). The colour code is the same for the time-lines and the vertical lines.
  • Figure 3: Reaching the bottle in a touching action.
  • Figure 4: Reaching the glass in the drinking action. Which is in front of the subject, indeed in \ref{['fig_reach_glass_1']} the cup is in the line of sight of the subject from the very beginning of the action.
  • Figure 5: Moving bottle to the target position in the transporting action. It is very clear how the head goes to look toward that point and the hand follow consequently.