OAT: Object-Level Attention Transformer for Gaze Scanpath Prediction

Yini Fang; Jingling Yu; Haozheng Zhang; Ralf van der Lans; Bertram Shi

OAT: Object-Level Attention Transformer for Gaze Scanpath Prediction

Yini Fang, Jingling Yu, Haozheng Zhang, Ralf van der Lans, Bertram Shi

TL;DR

This work tackles gaze scanpath prediction in cluttered visual scenes by shifting from pixel-level saliency to object-level attention. It introduces the Object Attention Transformer (OAT), an encoder–decoder transformer that embeds objects, includes the target as a token in the encoder, and uses a memory-enabled cross-attention object attention module to predict the next fixated object. Key innovations include a distance-based 2D positional encoding and an object-centric attention mechanism that scales to varying scene sizes and target objects, enabling accurate sequence modeling of gaze. Evaluated on the Amazon book cover dataset and a collected yogurt/wine dataset, OAT achieves state-of-the-art alignment with human scanpaths and demonstrates strong generalization, supported by a novel behavioural-based metric and comprehensive ablations.

Abstract

Visual search is important in our daily life. The efficient allocation of visual attention is critical to effectively complete visual search tasks. Prior research has predominantly modelled the spatial allocation of visual attention in images at the pixel level, e.g. using a saliency map. However, emerging evidence shows that visual attention is guided by objects rather than pixel intensities. This paper introduces the Object-level Attention Transformer (OAT), which predicts human scanpaths as they search for a target object within a cluttered scene of distractors. OAT uses an encoder-decoder architecture. The encoder captures information about the position and appearance of the objects within an image and about the target. The decoder predicts the gaze scanpath as a sequence of object fixations, by integrating output features from both the encoder and decoder. We also propose a new positional encoding that better reflects spatial relationships between objects. We evaluated OAT on the Amazon book cover dataset and a new dataset for visual search that we collected. OAT's predicted gaze scanpaths align more closely with human gaze patterns, compared to predictions by algorithms based on spatial attention on both established metrics and a novel behavioural-based metric. Our results demonstrate the generalization ability of OAT, as it accurately predicts human scanpaths for unseen layouts and target objects.

OAT: Object-Level Attention Transformer for Gaze Scanpath Prediction

TL;DR

Abstract

Paper Structure (20 sections, 1 equation, 7 figures, 7 tables)

This paper contains 20 sections, 1 equation, 7 figures, 7 tables.

Introduction
Related Work
OAT Architecture
Object Embedding and Encoder.
Distance-based Positional Encoding.
Decoder.
Object Attention (OA)
Training and Testing.
Experimental Results
Datasets
Implementation Settings
Metrics
Quantitative Performance Comparison
Scanpath Visualization in Spatial Dimension
Generalization to Unknown Categories
...and 5 more sections

Figures (7)

Figure 1: Illustration of object-level scanpath prediction.
Figure 2: An overview of OAT. The input to the encoder is a target object, an image containing the target object and other distractor objects, and the input to the decoder is the sequence of previously and currently fixated objects. The output is a probability distribution over objects being fixated next. The model repeats the process until the next token is the end token <EOS>. This is the process of predicting one gaze scanpath.
Figure 3: (left) Cosine similarity of the PE at the centre position and the PEs at other positions for the PE in vaswani2017attention. (right) A Gaussian distribution.
Figure 4: Example product array images.
Figure 5: Heatmap of scanpaths predicted by OAT and generated by human.
...and 2 more figures

OAT: Object-Level Attention Transformer for Gaze Scanpath Prediction

TL;DR

Abstract

OAT: Object-Level Attention Transformer for Gaze Scanpath Prediction

Authors

TL;DR

Abstract

Table of Contents

Figures (7)