GazeHTA: End-to-end Gaze Target Detection with Head-Target Association
Zhi-Yi Lin, Jouh Yeong Chew, Jan van Gemert, Xucong Zhang
TL;DR
GazeHTA tackles the problem of end-to-end gaze target detection by predicting multiple head–target pairs directly from a scene image. It combines a pre-trained diffusion-model backbone for rich scene features, a head feature re-injection module to strengthen head priors, and explicit connection maps to link each head to its gaze target, all trained with bipartite matching and a multi-term heatmap loss. The approach achieves state-of-the-art performance on GazeFollow and VideoAttentionTarget, with notable improvements in head–target association and robustness across backbones, and demonstrates that diffusion-based features can be effectively leveraged for gaze understanding. This method enhances practical human–robot interaction by providing accurate, scalable identification of what people are looking at in complex scenes, with explicit head–target associations that support downstream decision-making.
Abstract
Precisely detecting which object a person is paying attention to is critical for human-robot interaction since it provides important cues for the next action from the human user. We propose an end-to-end approach for gaze target detection: predicting a head-target connection between individuals and the target image regions they are looking at. Most of the existing methods use independent components such as off-the-shelf head detectors or have problems in establishing associations between heads and gaze targets. In contrast, we investigate an end-to-end multi-person Gaze target detection framework with Heads and Targets Association (GazeHTA), which predicts multiple head-target instances based solely on input scene image. GazeHTA addresses challenges in gaze target detection by (1) leveraging a pre-trained diffusion model to extract scene features for rich semantic understanding, (2) re-injecting a head feature to enhance the head priors for improved head understanding, and (3) learning a connection map as the explicit visual associations between heads and gaze targets. Our extensive experimental results demonstrate that GazeHTA outperforms state-of-the-art gaze target detection methods and two adapted diffusion-based baselines on two standard datasets.
