Table of Contents
Fetching ...

Boosting Gaze Object Prediction via Pixel-level Supervision from Vision Foundation Model

Yang Jin, Lei Zhang, Shi Yan, Bin Fan, Binglu Wang

TL;DR

This work tackles gaze object prediction by transitioning from coarse box-level supervision to pixel-level gaze object segmentation using Vision Foundation Model supervision. It introduces a unified, end-to-end GOS framework built on MaskDINO with RoI-based head feature reconstruction and a space-to-object gaze regression pipeline that combines dual attention and semantically aware mask interactions to produce accurate gaze heatmaps. The approach leverages SAM-generated masks as pixel-level supervision, enabling precise object localization in dense scenes and removing the need for extra head priors, while achieving strong performance on GOO-Synth and GOO-Real with competitive detection/segmentation metrics and improved gaze estimation, at roughly 14 FPS. Extensive ablations validate the contributions of pixel-level supervision, RoI reconstruction, dual attention fusion, and the interaction with object masks, and qualitative results illustrate reduced semantic ambiguity in gaze localization. A noted limitation is the reliance on SAM prompts for mask generation, suggesting future work on more flexible mask generation without strong location priors.

Abstract

Gaze object prediction (GOP) aims to predict the category and location of the object that a human is looking at. Previous methods utilized box-level supervision to identify the object that a person is looking at, but struggled with semantic ambiguity, ie, a single box may contain several items since objects are close together. The Vision foundation model (VFM) has improved in object segmentation using box prompts, which can reduce confusion by more precisely locating objects, offering advantages for fine-grained prediction of gaze objects. This paper presents a more challenging gaze object segmentation (GOS) task, which involves inferring the pixel-level mask corresponding to the object captured by human gaze behavior. In particular, we propose that the pixel-level supervision provided by VFM can be integrated into gaze object prediction to mitigate semantic ambiguity. This leads to our gaze object detection and segmentation framework that enables accurate pixel-level predictions. Different from previous methods that require additional head input or ignore head features, we propose to automatically obtain head features from scene features to ensure the model's inference efficiency and flexibility in the real world. Moreover, rather than directly fuse features to predict gaze heatmap as in existing methods, which may overlook spatial location and subtle details of the object, we develop a space-to-object gaze regression method to facilitate human-object gaze interaction. Specifically, it first constructs an initial human-object spatial connection, then refines this connection by interacting with semantically clear features in the segmentation branch, ultimately predicting a gaze heatmap for precise localization. Extensive experiments on GOO-Synth and GOO-Real datasets demonstrate the effectiveness of our method.

Boosting Gaze Object Prediction via Pixel-level Supervision from Vision Foundation Model

TL;DR

This work tackles gaze object prediction by transitioning from coarse box-level supervision to pixel-level gaze object segmentation using Vision Foundation Model supervision. It introduces a unified, end-to-end GOS framework built on MaskDINO with RoI-based head feature reconstruction and a space-to-object gaze regression pipeline that combines dual attention and semantically aware mask interactions to produce accurate gaze heatmaps. The approach leverages SAM-generated masks as pixel-level supervision, enabling precise object localization in dense scenes and removing the need for extra head priors, while achieving strong performance on GOO-Synth and GOO-Real with competitive detection/segmentation metrics and improved gaze estimation, at roughly 14 FPS. Extensive ablations validate the contributions of pixel-level supervision, RoI reconstruction, dual attention fusion, and the interaction with object masks, and qualitative results illustrate reduced semantic ambiguity in gaze localization. A noted limitation is the reliance on SAM prompts for mask generation, suggesting future work on more flexible mask generation without strong location priors.

Abstract

Gaze object prediction (GOP) aims to predict the category and location of the object that a human is looking at. Previous methods utilized box-level supervision to identify the object that a person is looking at, but struggled with semantic ambiguity, ie, a single box may contain several items since objects are close together. The Vision foundation model (VFM) has improved in object segmentation using box prompts, which can reduce confusion by more precisely locating objects, offering advantages for fine-grained prediction of gaze objects. This paper presents a more challenging gaze object segmentation (GOS) task, which involves inferring the pixel-level mask corresponding to the object captured by human gaze behavior. In particular, we propose that the pixel-level supervision provided by VFM can be integrated into gaze object prediction to mitigate semantic ambiguity. This leads to our gaze object detection and segmentation framework that enables accurate pixel-level predictions. Different from previous methods that require additional head input or ignore head features, we propose to automatically obtain head features from scene features to ensure the model's inference efficiency and flexibility in the real world. Moreover, rather than directly fuse features to predict gaze heatmap as in existing methods, which may overlook spatial location and subtle details of the object, we develop a space-to-object gaze regression method to facilitate human-object gaze interaction. Specifically, it first constructs an initial human-object spatial connection, then refines this connection by interacting with semantically clear features in the segmentation branch, ultimately predicting a gaze heatmap for precise localization. Extensive experiments on GOO-Synth and GOO-Real datasets demonstrate the effectiveness of our method.
Paper Structure (19 sections, 12 equations, 12 figures, 9 tables)

This paper contains 19 sections, 12 equations, 12 figures, 9 tables.

Figures (12)

  • Figure 1: (a) Box-level supervision often fails to localize objects in dense settings precisely and leads to semantic ambiguity problems, whereas pixel-level supervision excels by providing clear semantic distinction through pixel-by-pixel predictions. (b) Vision foundation models can produce instance masks, thereby segmentation features can be used to improve the gaze regression branch's spatial perception, and the gaze object mask can help the heatmap focus on the gaze object.
  • Figure 2: Overview of the proposed model.(a) The feature extraction module extracts features for detection and regression branches. (b) The detection and segmentation branch identifies object and human head positions. (c) After obtaining head features, the gaze regression branch progressively refines its output: 1) employing a dual attention fusion module for initial human-object correlations; 2) leveraging a feature interaction module to incorporate semantically clear object-aware insights from the segmentation branch; 3) ultimately predicting the gaze heatmap. (d) Supervision signals are applied to both branches only during training.
  • Figure 3: Illustration of instance mask generated by VFM.
  • Figure 4: Illustration of spatial perception for gaze objects.
  • Figure 5: Comparison of TransGOP wang24transgop and our method inference in the real world.
  • ...and 7 more figures