Boosting Gaze Object Prediction via Pixel-level Supervision from Vision Foundation Model
Yang Jin, Lei Zhang, Shi Yan, Bin Fan, Binglu Wang
TL;DR
This work tackles gaze object prediction by transitioning from coarse box-level supervision to pixel-level gaze object segmentation using Vision Foundation Model supervision. It introduces a unified, end-to-end GOS framework built on MaskDINO with RoI-based head feature reconstruction and a space-to-object gaze regression pipeline that combines dual attention and semantically aware mask interactions to produce accurate gaze heatmaps. The approach leverages SAM-generated masks as pixel-level supervision, enabling precise object localization in dense scenes and removing the need for extra head priors, while achieving strong performance on GOO-Synth and GOO-Real with competitive detection/segmentation metrics and improved gaze estimation, at roughly 14 FPS. Extensive ablations validate the contributions of pixel-level supervision, RoI reconstruction, dual attention fusion, and the interaction with object masks, and qualitative results illustrate reduced semantic ambiguity in gaze localization. A noted limitation is the reliance on SAM prompts for mask generation, suggesting future work on more flexible mask generation without strong location priors.
Abstract
Gaze object prediction (GOP) aims to predict the category and location of the object that a human is looking at. Previous methods utilized box-level supervision to identify the object that a person is looking at, but struggled with semantic ambiguity, ie, a single box may contain several items since objects are close together. The Vision foundation model (VFM) has improved in object segmentation using box prompts, which can reduce confusion by more precisely locating objects, offering advantages for fine-grained prediction of gaze objects. This paper presents a more challenging gaze object segmentation (GOS) task, which involves inferring the pixel-level mask corresponding to the object captured by human gaze behavior. In particular, we propose that the pixel-level supervision provided by VFM can be integrated into gaze object prediction to mitigate semantic ambiguity. This leads to our gaze object detection and segmentation framework that enables accurate pixel-level predictions. Different from previous methods that require additional head input or ignore head features, we propose to automatically obtain head features from scene features to ensure the model's inference efficiency and flexibility in the real world. Moreover, rather than directly fuse features to predict gaze heatmap as in existing methods, which may overlook spatial location and subtle details of the object, we develop a space-to-object gaze regression method to facilitate human-object gaze interaction. Specifically, it first constructs an initial human-object spatial connection, then refines this connection by interacting with semantically clear features in the segmentation branch, ultimately predicting a gaze heatmap for precise localization. Extensive experiments on GOO-Synth and GOO-Real datasets demonstrate the effectiveness of our method.
