GaTector+: A Unified Head-free Framework for Gaze Object and Gaze Following Prediction
Yang Jin, Guangyu Guo, Binglu Wang
TL;DR
GaTector+ presents a head-free, unified approach for simultaneous gaze object detection and gaze following. It extends the Specific-General-Specific (SGS) paradigm to SGS+ with a head detector and a head-based attention mechanism, enabling fusion of scene and gaze information without head priors during inference. The model is trained with joint losses including detection, gaze heatmap estimation, attention supervision, and a box energy aggregation term, and is evaluated with the new mSoC metric that robustly measures gaze-object localization. Empirically, GaTector+ achieves state-of-the-art performance on multiple datasets in both gaze object detection and gaze following, while maintaining practical inference without head priors, highlighting its applicability to real-world gaze interpretation tasks.
Abstract
Gaze object detection and gaze following are fundamental tasks for interpreting human gaze behavior or intent. However, most previous methods usually solve these two tasks separately, and their prediction of gaze objects and gaze following typically depend on head-related prior knowledge during both the training phase and real-world deployment. This dependency necessitates an auxiliary network to extract head location, thus precluding joint optimization across the entire system and constraining the practical applicability. To this end, we propose GaTector+, a unified framework for gaze object detection and gaze following, which eliminates the dependence on the head-related priors during inference. Specifically, GaTector+ uses an expanded specific-general-specific feature extractor that leverages a shared backbone, which extracts general features for gaze following and object detection using the shared backbone while using specific blocks before and after the shared backbone to better consider the specificity of each sub-task. To obtain head-related knowledge without prior information, we first embed a head detection branch to predict the head of each person. Then, before regressing the gaze point, a head-based attention mechanism is proposed to fuse the sense feature and gaze feature with the help of head location. Since the suboptimization of the gaze point heatmap leads to the performance bottleneck, we propose an attention supervision mechanism to accelerate the learning of the gaze heatmap. Finally, we propose a novel evaluation metric, mean Similarity over Candidates (mSoC), for gaze object detection, which is more sensitive to variations between bounding boxes. The experimental results on multiple benchmark datasets demonstrate the effectiveness of our model in both gaze object detection and gaze following tasks.
