Anticipation through Head Pose Estimation: a preliminary study

Federico Figari Tomenotti; Nicoletta Noceti

Anticipation through Head Pose Estimation: a preliminary study

Federico Figari Tomenotti, Nicoletta Noceti

TL;DR

The paper investigates whether 3D head pose can anticipate action goals during reaching and transporting tasks. It introduces a geometry-based, video-based pipeline that estimates head pose from 5 facial landmarks using HHP-Net, integrates object and body detections (via Centernet and YOLOv8), and reasons about the spatio-temporal relationships among head, hand, and scene objects to identify gazing_target_time, touching_object_time, and target_object_time. Anticipation is quantified as the temporal offset between gazing and action milestones, with findings showing an average head-led anticipation of $0.5$ s (approximately $15$ frames at $30fps$) across actions, and modulation by object positioning. This work provides a practical step toward online, non-verbal cue-based anticipation to improve human-robot interaction, laying groundwork for predictive models in social robotics.

Abstract

The ability to anticipate others' goals and intentions is at the basis of human-human social interaction. Such ability, largely based on non-verbal communication, is also a key to having natural and pleasant interactions with artificial agents, like robots. In this work, we discuss a preliminary experiment on the use of head pose as a visual cue to understand and anticipate action goals, particularly reaching and transporting movements. By reasoning on the spatio-temporal connections between the head, hands and objects in the scene, we will show that short-range anticipation is possible, laying the foundations for future applications to human-robot interaction.

Anticipation through Head Pose Estimation: a preliminary study

TL;DR

s (approximately

frames at

) across actions, and modulation by object positioning. This work provides a practical step toward online, non-verbal cue-based anticipation to improve human-robot interaction, laying groundwork for predictive models in social robotics.

Abstract

Paper Structure (6 sections, 5 figures, 1 table)

This paper contains 6 sections, 5 figures, 1 table.

Introduction
Method
Experimental Section
Dataset
Experiments
Conclusion

Figures (5)

Figure 1: (a,b,c): reaching from transporting action.\ref{['figSub_sfig1']} The head starts to move. \ref{['figSub_sfig2']} The head projection onto the table reached the target. \ref{['figSub_sfig3']} The hand reached the target after more than 10 frames (1/3 of a second). (d,e,f): transporting from transporting action. \ref{['figSub_transport_frames1']} The head and hand are aligned. \ref{['figSub_transport_frames2']} The head projection onto the table reached the target at the very beginning of the hand movements. \ref{['figSub_transport_frames3']} The hand reached the target. In light blue/violet bounding boxes and annotations are shown. The image is in black-white only for clarity.
Figure 2: Reaching the bottle in a transporting action. The vertical lines indicate the gazing_target_time (red) and the touching_object_time (blue). The colour code is the same for the time-lines and the vertical lines.
Figure 3: Reaching the bottle in a touching action.
Figure 4: Reaching the glass in the drinking action. Which is in front of the subject, indeed in \ref{['fig_reach_glass_1']} the cup is in the line of sight of the subject from the very beginning of the action.
Figure 5: Moving bottle to the target position in the transporting action. It is very clear how the head goes to look toward that point and the hand follow consequently.

Anticipation through Head Pose Estimation: a preliminary study

TL;DR

Abstract

Anticipation through Head Pose Estimation: a preliminary study

Authors

TL;DR

Abstract

Table of Contents

Figures (5)