A Diver Attention Estimation Framework for Effective Underwater Human-Robot Interaction
Sadman Sakib Enan, Junaed Sattar
TL;DR
This work tackles underwater HRI by estimating diver attentiveness from monocular imagery to decide when an AUV should initiate interaction. It introduces the Diver Attention (DATT) framework, comprising the DATT dataset and DATT-Net, a pyramid-based network with a geometric loss that regresses 10 facial keypoints to infer head pose under water. A multi-loss objective, including L_fc, L_fb, L_kp, and L_gm with hyperparameters $\alpha=0.2$, $\beta=0.15$, and $\gamma=0.1$, enables robust keypoint detection even with occlusions, achieving high accuracy (e.g., AP $=91.75\%$, mAP $=72.56\%$ for detection and PCK $=86.21\%$ for keypoints). The framework is deployed on the Aqua AUV with a ROS-based controller that uses attentiveness signals to navigate and reorient for interaction, validated in closed- and open-water trials showing reliable attentiveness estimation ($\approx89.41\%$ accuracy) and effective autonomous engagement. This work advances autonomous underwater HRI by enabling monocular, gear-robust diver attitude understanding without global localization, potentially reducing diver workload and enabling safer, more scalable collaboration.
Abstract
Many underwater tasks, such as cable-and-wreckage inspection and search-and-rescue, can benefit from robust Human-Robot Interaction (HRI) capabilities. With the recent advancements in vision-based underwater HRI methods, Autonomous Underwater Vehicles (AUVs) have the capability to interact with their human partners without requiring assistance from a topside operator. However, in these methods, the AUV assumes that the diver is ready for interaction, while in reality, the diver may be distracted. In this paper, we attempt to address this problem by presenting a diver attention estimation framework for AUVs to autonomously determine the attentiveness of a diver, and developing a robot controller to allow the AUV to navigate and reorient itself with respect to the diver before initiating interaction. The core element of the framework is a deep convolutional neural network called DATT-Net. It is based on a pyramid structure that can exploit the geometric relations among 10 facial keypoints of a diver to estimate their head orientation, which we use as an indicator of attentiveness. Our on-the-bench experimental evaluations and real-world experiments during both closed- and open-water robot trials confirm the efficacy of the proposed framework.
