Table of Contents
Fetching ...

GazeHTA: End-to-end Gaze Target Detection with Head-Target Association

Zhi-Yi Lin, Jouh Yeong Chew, Jan van Gemert, Xucong Zhang

TL;DR

GazeHTA tackles the problem of end-to-end gaze target detection by predicting multiple head–target pairs directly from a scene image. It combines a pre-trained diffusion-model backbone for rich scene features, a head feature re-injection module to strengthen head priors, and explicit connection maps to link each head to its gaze target, all trained with bipartite matching and a multi-term heatmap loss. The approach achieves state-of-the-art performance on GazeFollow and VideoAttentionTarget, with notable improvements in head–target association and robustness across backbones, and demonstrates that diffusion-based features can be effectively leveraged for gaze understanding. This method enhances practical human–robot interaction by providing accurate, scalable identification of what people are looking at in complex scenes, with explicit head–target associations that support downstream decision-making.

Abstract

Precisely detecting which object a person is paying attention to is critical for human-robot interaction since it provides important cues for the next action from the human user. We propose an end-to-end approach for gaze target detection: predicting a head-target connection between individuals and the target image regions they are looking at. Most of the existing methods use independent components such as off-the-shelf head detectors or have problems in establishing associations between heads and gaze targets. In contrast, we investigate an end-to-end multi-person Gaze target detection framework with Heads and Targets Association (GazeHTA), which predicts multiple head-target instances based solely on input scene image. GazeHTA addresses challenges in gaze target detection by (1) leveraging a pre-trained diffusion model to extract scene features for rich semantic understanding, (2) re-injecting a head feature to enhance the head priors for improved head understanding, and (3) learning a connection map as the explicit visual associations between heads and gaze targets. Our extensive experimental results demonstrate that GazeHTA outperforms state-of-the-art gaze target detection methods and two adapted diffusion-based baselines on two standard datasets.

GazeHTA: End-to-end Gaze Target Detection with Head-Target Association

TL;DR

GazeHTA tackles the problem of end-to-end gaze target detection by predicting multiple head–target pairs directly from a scene image. It combines a pre-trained diffusion-model backbone for rich scene features, a head feature re-injection module to strengthen head priors, and explicit connection maps to link each head to its gaze target, all trained with bipartite matching and a multi-term heatmap loss. The approach achieves state-of-the-art performance on GazeFollow and VideoAttentionTarget, with notable improvements in head–target association and robustness across backbones, and demonstrates that diffusion-based features can be effectively leveraged for gaze understanding. This method enhances practical human–robot interaction by providing accurate, scalable identification of what people are looking at in complex scenes, with explicit head–target associations that support downstream decision-making.

Abstract

Precisely detecting which object a person is paying attention to is critical for human-robot interaction since it provides important cues for the next action from the human user. We propose an end-to-end approach for gaze target detection: predicting a head-target connection between individuals and the target image regions they are looking at. Most of the existing methods use independent components such as off-the-shelf head detectors or have problems in establishing associations between heads and gaze targets. In contrast, we investigate an end-to-end multi-person Gaze target detection framework with Heads and Targets Association (GazeHTA), which predicts multiple head-target instances based solely on input scene image. GazeHTA addresses challenges in gaze target detection by (1) leveraging a pre-trained diffusion model to extract scene features for rich semantic understanding, (2) re-injecting a head feature to enhance the head priors for improved head understanding, and (3) learning a connection map as the explicit visual associations between heads and gaze targets. Our extensive experimental results demonstrate that GazeHTA outperforms state-of-the-art gaze target detection methods and two adapted diffusion-based baselines on two standard datasets.
Paper Structure (23 sections, 4 equations, 3 figures, 2 tables)

This paper contains 23 sections, 4 equations, 3 figures, 2 tables.

Figures (3)

  • Figure 1: GazeHTA takes the scene image as input to predict head-target instances. GazeHTA consists of a pre-trained diffusion model as the scene feature extractor, a head feature re-injection mechanism, and the connection maps as the visual associations between heads and gaze targets.
  • Figure 2: Network architecture of GazeHTA. The input scene image is transformed into scene features through a pre-trained image encoder and a denoising U-Net. The scene features are encoded for out-of-frame predictions via a fully connected layer, and for head-target proposals. In the head-target prediction branch, a learned head feature $\textbf{F}_\text{head}$ is re-injected and fused with the decoded feature $\textbf{F}_\text{dec}$. The resulting feature $\textbf{F}_\text{prop}$ is then used to predict $N$ head-target proposals, including head heatmaps $\textbf{H}$, gaze heatmaps $\textbf{G}$, and connection maps $\textbf{C}$, which are explicitly learned to associate the head and gaze target. Conv represents convolutional layers.
  • Figure 3: Predicted head heatmap, gaze heatmap, and the connection map from GazeHTA. The first two rows demonstrate the strong associations between heads and in-frame gaze targets established by the connection maps. The third row shows the comprehensive understanding of out-of-frame gaze targets in GazeHTA.