A Diver Attention Estimation Framework for Effective Underwater Human-Robot Interaction

Sadman Sakib Enan; Junaed Sattar

A Diver Attention Estimation Framework for Effective Underwater Human-Robot Interaction

Sadman Sakib Enan, Junaed Sattar

TL;DR

This work tackles underwater HRI by estimating diver attentiveness from monocular imagery to decide when an AUV should initiate interaction. It introduces the Diver Attention (DATT) framework, comprising the DATT dataset and DATT-Net, a pyramid-based network with a geometric loss that regresses 10 facial keypoints to infer head pose under water. A multi-loss objective, including L_fc, L_fb, L_kp, and L_gm with hyperparameters $\alpha=0.2$, $\beta=0.15$, and $\gamma=0.1$, enables robust keypoint detection even with occlusions, achieving high accuracy (e.g., AP $=91.75\%$, mAP $=72.56\%$ for detection and PCK $=86.21\%$ for keypoints). The framework is deployed on the Aqua AUV with a ROS-based controller that uses attentiveness signals to navigate and reorient for interaction, validated in closed- and open-water trials showing reliable attentiveness estimation ($\approx89.41\%$ accuracy) and effective autonomous engagement. This work advances autonomous underwater HRI by enabling monocular, gear-robust diver attitude understanding without global localization, potentially reducing diver workload and enabling safer, more scalable collaboration.

Abstract

Many underwater tasks, such as cable-and-wreckage inspection and search-and-rescue, can benefit from robust Human-Robot Interaction (HRI) capabilities. With the recent advancements in vision-based underwater HRI methods, Autonomous Underwater Vehicles (AUVs) have the capability to interact with their human partners without requiring assistance from a topside operator. However, in these methods, the AUV assumes that the diver is ready for interaction, while in reality, the diver may be distracted. In this paper, we attempt to address this problem by presenting a diver attention estimation framework for AUVs to autonomously determine the attentiveness of a diver, and developing a robot controller to allow the AUV to navigate and reorient itself with respect to the diver before initiating interaction. The core element of the framework is a deep convolutional neural network called DATT-Net. It is based on a pyramid structure that can exploit the geometric relations among 10 facial keypoints of a diver to estimate their head orientation, which we use as an indicator of attentiveness. Our on-the-bench experimental evaluations and real-world experiments during both closed- and open-water robot trials confirm the efficacy of the proposed framework.

A Diver Attention Estimation Framework for Effective Underwater Human-Robot Interaction

TL;DR

, and

, enables robust keypoint detection even with occlusions, achieving high accuracy (e.g., AP

, mAP

for detection and PCK

for keypoints). The framework is deployed on the Aqua AUV with a ROS-based controller that uses attentiveness signals to navigate and reorient for interaction, validated in closed- and open-water trials showing reliable attentiveness estimation (

accuracy) and effective autonomous engagement. This work advances autonomous underwater HRI by enabling monocular, gear-robust diver attitude understanding without global localization, potentially reducing diver workload and enabling safer, more scalable collaboration.

Abstract

Paper Structure (17 sections, 4 equations, 6 figures, 2 tables)

This paper contains 17 sections, 4 equations, 6 figures, 2 tables.

INTRODUCTION
Related Work
DATT Dataset
DATT-NET
Feature Extractor
Facial Anchors
Objective Function Formulation
Diver Attention Estimation
Robot Controller
Experimental Evaluations
Implementation Details
Results
Diver Face Detection and Keypoints Regression
Diver Attention Estimation
Robot Controller's Performance
...and 2 more sections

Figures (6)

Figure 1: Demonstration of the proposed framework running on-board the Aqua AUV dudek2007aqua during our open-water experiments in the Caribbean Sea, off the coast of Barbados. The red dashed line represents the robot's future trajectory to position itself conveniently for interaction.
Figure 2: (a) A sample image from the DATT dataset and the corresponding annotations. Each annotation consists of a bounding box label, $\Vec{b}=[x_{min},y_{min},x_{max},y_{max}]$, for the face, and a label for $10$ facial keypoints denoted by, $\Vec{p}=[x_1, y_1, \cdots, x_{10}, y_{10}]$. (b) A few additional annotated samples where the divers are wearing different types of scuba gear and are looking at different directions.
Figure 3: The network architecture of DATT-Net employs multi-scale learning on a feature pyramid and includes an additional supervision branch to learn the geometric relations of $10$ facial keypoints. This allows the network to operate effectively even when the diver is facing the robot at a $90$-degree angle.
Figure 4: The training performance of DATT-Net in terms of the validation loss where the backbone is either (a) ResNet-50, or (b) MobileNet-V2. Here, iterations = $(\text{no. of training images}/\text{batch size})*\text{epochs}$.
Figure 5: Qualitative performance of the proposed diver attention estimation framework when run on-board the Aqua AUV in both closed- and open-water environments. Note the robustness of our method as it works at different distances and in challenging lighting conditions.
...and 1 more figures

A Diver Attention Estimation Framework for Effective Underwater Human-Robot Interaction

TL;DR

Abstract

A Diver Attention Estimation Framework for Effective Underwater Human-Robot Interaction

Authors

TL;DR

Abstract

Table of Contents

Figures (6)