Anticipation through Head Pose Estimation: a preliminary study
Federico Figari Tomenotti, Nicoletta Noceti
TL;DR
The paper investigates whether 3D head pose can anticipate action goals during reaching and transporting tasks. It introduces a geometry-based, video-based pipeline that estimates head pose from 5 facial landmarks using HHP-Net, integrates object and body detections (via Centernet and YOLOv8), and reasons about the spatio-temporal relationships among head, hand, and scene objects to identify gazing_target_time, touching_object_time, and target_object_time. Anticipation is quantified as the temporal offset between gazing and action milestones, with findings showing an average head-led anticipation of $0.5$ s (approximately $15$ frames at $30fps$) across actions, and modulation by object positioning. This work provides a practical step toward online, non-verbal cue-based anticipation to improve human-robot interaction, laying groundwork for predictive models in social robotics.
Abstract
The ability to anticipate others' goals and intentions is at the basis of human-human social interaction. Such ability, largely based on non-verbal communication, is also a key to having natural and pleasant interactions with artificial agents, like robots. In this work, we discuss a preliminary experiment on the use of head pose as a visual cue to understand and anticipate action goals, particularly reaching and transporting movements. By reasoning on the spatio-temporal connections between the head, hands and objects in the scene, we will show that short-range anticipation is possible, laying the foundations for future applications to human-robot interaction.
