Table of Contents
Fetching ...

GaTector+: A Unified Head-free Framework for Gaze Object and Gaze Following Prediction

Yang Jin, Guangyu Guo, Binglu Wang

TL;DR

GaTector+ presents a head-free, unified approach for simultaneous gaze object detection and gaze following. It extends the Specific-General-Specific (SGS) paradigm to SGS+ with a head detector and a head-based attention mechanism, enabling fusion of scene and gaze information without head priors during inference. The model is trained with joint losses including detection, gaze heatmap estimation, attention supervision, and a box energy aggregation term, and is evaluated with the new mSoC metric that robustly measures gaze-object localization. Empirically, GaTector+ achieves state-of-the-art performance on multiple datasets in both gaze object detection and gaze following, while maintaining practical inference without head priors, highlighting its applicability to real-world gaze interpretation tasks.

Abstract

Gaze object detection and gaze following are fundamental tasks for interpreting human gaze behavior or intent. However, most previous methods usually solve these two tasks separately, and their prediction of gaze objects and gaze following typically depend on head-related prior knowledge during both the training phase and real-world deployment. This dependency necessitates an auxiliary network to extract head location, thus precluding joint optimization across the entire system and constraining the practical applicability. To this end, we propose GaTector+, a unified framework for gaze object detection and gaze following, which eliminates the dependence on the head-related priors during inference. Specifically, GaTector+ uses an expanded specific-general-specific feature extractor that leverages a shared backbone, which extracts general features for gaze following and object detection using the shared backbone while using specific blocks before and after the shared backbone to better consider the specificity of each sub-task. To obtain head-related knowledge without prior information, we first embed a head detection branch to predict the head of each person. Then, before regressing the gaze point, a head-based attention mechanism is proposed to fuse the sense feature and gaze feature with the help of head location. Since the suboptimization of the gaze point heatmap leads to the performance bottleneck, we propose an attention supervision mechanism to accelerate the learning of the gaze heatmap. Finally, we propose a novel evaluation metric, mean Similarity over Candidates (mSoC), for gaze object detection, which is more sensitive to variations between bounding boxes. The experimental results on multiple benchmark datasets demonstrate the effectiveness of our model in both gaze object detection and gaze following tasks.

GaTector+: A Unified Head-free Framework for Gaze Object and Gaze Following Prediction

TL;DR

GaTector+ presents a head-free, unified approach for simultaneous gaze object detection and gaze following. It extends the Specific-General-Specific (SGS) paradigm to SGS+ with a head detector and a head-based attention mechanism, enabling fusion of scene and gaze information without head priors during inference. The model is trained with joint losses including detection, gaze heatmap estimation, attention supervision, and a box energy aggregation term, and is evaluated with the new mSoC metric that robustly measures gaze-object localization. Empirically, GaTector+ achieves state-of-the-art performance on multiple datasets in both gaze object detection and gaze following, while maintaining practical inference without head priors, highlighting its applicability to real-world gaze interpretation tasks.

Abstract

Gaze object detection and gaze following are fundamental tasks for interpreting human gaze behavior or intent. However, most previous methods usually solve these two tasks separately, and their prediction of gaze objects and gaze following typically depend on head-related prior knowledge during both the training phase and real-world deployment. This dependency necessitates an auxiliary network to extract head location, thus precluding joint optimization across the entire system and constraining the practical applicability. To this end, we propose GaTector+, a unified framework for gaze object detection and gaze following, which eliminates the dependence on the head-related priors during inference. Specifically, GaTector+ uses an expanded specific-general-specific feature extractor that leverages a shared backbone, which extracts general features for gaze following and object detection using the shared backbone while using specific blocks before and after the shared backbone to better consider the specificity of each sub-task. To obtain head-related knowledge without prior information, we first embed a head detection branch to predict the head of each person. Then, before regressing the gaze point, a head-based attention mechanism is proposed to fuse the sense feature and gaze feature with the help of head location. Since the suboptimization of the gaze point heatmap leads to the performance bottleneck, we propose an attention supervision mechanism to accelerate the learning of the gaze heatmap. Finally, we propose a novel evaluation metric, mean Similarity over Candidates (mSoC), for gaze object detection, which is more sensitive to variations between bounding boxes. The experimental results on multiple benchmark datasets demonstrate the effectiveness of our model in both gaze object detection and gaze following tasks.

Paper Structure

This paper contains 19 sections, 14 equations, 9 figures, 8 tables.

Figures (9)

  • Figure 1: Comparison of different strategies for gaze object prediction. (a) Two separate branches are used to extract head and scene features. (b) GaTector wang2022gatector uses a unified framework to solve this problem but still requires a head image as input in the inference process. (c) GaTector+ does not need any auxiliary detector and head prior information during the inference.
  • Figure 2: Overview of the proposed GaTector+. (a) The specific-general-specific mechanism (SGS+) consists of a sense-specific branch and a gaze-specific branch while sharing the backbone to fuse information. (b) The head-free gaze object and gaze location prediction consist of an object detector, head detector, and gaze regressor. The gaze regressor has a head location-based attention mechanism to fuse the sense and gaze information, we propose an attention supervision mechanism to reduce the learning difficulty of the gaze regressor process. An energy aggregation loss is used to jointly optimize the object detector and gaze regressor.
  • Figure 3: Illustration of the Defocus operation. One channel in the high-resolution feature map is transformed from $r^{2}$ channels in the low-resolution feature map.
  • Figure 4: Illustration of the gaze regressor.
  • Figure 5: (a) The definition of UoC and mSoC is demonstrated. (b) Comparison of the results of UoC, wUoC and mSoC in different situations: (1) and (2) are general situations; (3) For the ground truth box inside the prediction box, the UoC metric is invalid; (4) If the ground truth box and the prediction box have equal areas and are adjacent to each other, both UoC and wUoC are invalid.
  • ...and 4 more figures