Table of Contents
Fetching ...

TransGOP: Transformer-Based Gaze Object Prediction

Binglu Wang, Chenxi Guo, Yang Jin, Haisheng Xia, Nian Liu

TL;DR

This work tackles gaze object prediction by introducing TransGOP, an end-to-end Transformer-based framework combining a DETR-like object detector with a Transformer-based gaze regressor. A novel object-to-gaze cross-attention mechanism and a gaze box loss enable the model to learn long-range human–object relationships and jointly optimize object detection with gaze heatmap prediction. The approach achieves state-of-the-art results on GOO-Synth and GOO-Real across object detection, gaze estimation, and GOP, demonstrating the benefits of Transformer-based detectors in dense retail scenes and enabling end-to-end training without post-processing. The findings highlight the practical impact of incorporating long-range attention and joint optimization for accurate gaze-object localization and identification in real-world applications.

Abstract

Gaze object prediction aims to predict the location and category of the object that is watched by a human. Previous gaze object prediction works use CNN-based object detectors to predict the object's location. However, we find that Transformer-based object detectors can predict more accurate object location for dense objects in retail scenarios. Moreover, the long-distance modeling capability of the Transformer can help to build relationships between the human head and the gaze object, which is important for the GOP task. To this end, this paper introduces Transformer into the fields of gaze object prediction and proposes an end-to-end Transformer-based gaze object prediction method named TransGOP. Specifically, TransGOP uses an off-the-shelf Transformer-based object detector to detect the location of objects and designs a Transformer-based gaze autoencoder in the gaze regressor to establish long-distance gaze relationships. Moreover, to improve gaze heatmap regression, we propose an object-to-gaze cross-attention mechanism to let the queries of the gaze autoencoder learn the global-memory position knowledge from the object detector. Finally, to make the whole framework end-to-end trained, we propose a Gaze Box loss to jointly optimize the object detector and gaze regressor by enhancing the gaze heatmap energy in the box of the gaze object. Extensive experiments on the GOO-Synth and GOO-Real datasets demonstrate that our TransGOP achieves state-of-the-art performance on all tracks, i.e., object detection, gaze estimation, and gaze object prediction. Our code will be available at https://github.com/chenxi-Guo/TransGOP.git.

TransGOP: Transformer-Based Gaze Object Prediction

TL;DR

This work tackles gaze object prediction by introducing TransGOP, an end-to-end Transformer-based framework combining a DETR-like object detector with a Transformer-based gaze regressor. A novel object-to-gaze cross-attention mechanism and a gaze box loss enable the model to learn long-range human–object relationships and jointly optimize object detection with gaze heatmap prediction. The approach achieves state-of-the-art results on GOO-Synth and GOO-Real across object detection, gaze estimation, and GOP, demonstrating the benefits of Transformer-based detectors in dense retail scenes and enabling end-to-end training without post-processing. The findings highlight the practical impact of incorporating long-range attention and joint optimization for accurate gaze-object localization and identification in real-world applications.

Abstract

Gaze object prediction aims to predict the location and category of the object that is watched by a human. Previous gaze object prediction works use CNN-based object detectors to predict the object's location. However, we find that Transformer-based object detectors can predict more accurate object location for dense objects in retail scenarios. Moreover, the long-distance modeling capability of the Transformer can help to build relationships between the human head and the gaze object, which is important for the GOP task. To this end, this paper introduces Transformer into the fields of gaze object prediction and proposes an end-to-end Transformer-based gaze object prediction method named TransGOP. Specifically, TransGOP uses an off-the-shelf Transformer-based object detector to detect the location of objects and designs a Transformer-based gaze autoencoder in the gaze regressor to establish long-distance gaze relationships. Moreover, to improve gaze heatmap regression, we propose an object-to-gaze cross-attention mechanism to let the queries of the gaze autoencoder learn the global-memory position knowledge from the object detector. Finally, to make the whole framework end-to-end trained, we propose a Gaze Box loss to jointly optimize the object detector and gaze regressor by enhancing the gaze heatmap energy in the box of the gaze object. Extensive experiments on the GOO-Synth and GOO-Real datasets demonstrate that our TransGOP achieves state-of-the-art performance on all tracks, i.e., object detection, gaze estimation, and gaze object prediction. Our code will be available at https://github.com/chenxi-Guo/TransGOP.git.
Paper Structure (16 sections, 5 equations, 6 figures, 8 tables)

This paper contains 16 sections, 5 equations, 6 figures, 8 tables.

Figures (6)

  • Figure 1: Object Detection results of GaTector wang2022gatector (a) and our TransGOP (b) when IoU threshold is 0.75. TransGOP predicts the object location more accurately than the GaTector, especially for the objects that are close to human or goods shelves.
  • Figure 2: Overview framework of our TransGOP method. (a) The object detector in TransGOP is the of-the-shelf Transformer-based object detection method that detects object location and category. (b) The gaze regressor feeds the fused feature into the Transformer-based gaze autoencoder to predict the gaze heatmap. (c) The optimization of TransGOP consists of three parts: the object detection loss $\mathcal{L}_{\rm det}$ for optimizing the object detector, the gaze regression loss $\mathcal{L}_{\rm gaze}$ for optimizing the gaze regressor, and the gaze box loss $\mathcal{L}_{\rm gb}$ to jointly optimize the object detector and gaze regressor.
  • Figure 3: Details of the Transformer-based gaze autoencoder and the object-to-gaze cross-attention in the gaze regressor.
  • Figure 4: Illustration of the gaze box loss.
  • Figure 5: Object detection visualization of GaTector and TransGOP when IoU is 0.75.
  • ...and 1 more figures