Table of Contents
Fetching ...

Unifying Top-down and Bottom-up Scanpath Prediction Using Transformers

Zhibo Yang, Sounak Mondal, Seoyoung Ahn, Ruoyu Xue, Gregory Zelinsky, Minh Hoai, Dimitris Samaras

TL;DR

The paper presents HAT, a unified transformer-based framework that predicts both top-down and bottom-up visual attention by integrating a foveated retina-inspired memory and dense per-pixel predictions. By formulating scanpath prediction as sequential dense heatmaps with termination signals, and by maintaining a dynamic working memory updated at each fixation, HAT achieves state-of-the-art performance across target-present, target-absent, and free-viewing tasks, while offering interpretable insights through peripheral contributions. The approach demonstrates strong generalization to unseen scenes and shows competitive results on OSIE and MIT1003, highlighting its robustness and applicability to diverse attention-demanding scenarios. The authors provide extensive ablations, qualitative analyses, and implementation details to validate the architecture's components and its advantages over previous fixation-discretization methods. Overall, HAT advances computational attention by unifying distinct attention controls under a single, interpretable, high-resolution, dense-prediction framework with practical implications for AR/VR and attention-aware processing.

Abstract

Most models of visual attention aim at predicting either top-down or bottom-up control, as studied using different visual search and free-viewing tasks. In this paper we propose the Human Attention Transformer (HAT), a single model that predicts both forms of attention control. HAT uses a novel transformer-based architecture and a simplified foveated retina that collectively create a spatio-temporal awareness akin to the dynamic visual working memory of humans. HAT not only establishes a new state-of-the-art in predicting the scanpath of fixations made during target-present and target-absent visual search and ``taskless'' free viewing, but also makes human gaze behavior interpretable. Unlike previous methods that rely on a coarse grid of fixation cells and experience information loss due to fixation discretization, HAT features a sequential dense prediction architecture and outputs a dense heatmap for each fixation, thus avoiding discretizing fixations. HAT sets a new standard in computational attention, which emphasizes effectiveness, generality, and interpretability. HAT's demonstrated scope and applicability will likely inspire the development of new attention models that can better predict human behavior in various attention-demanding scenarios. Code is available at https://github.com/cvlab-stonybrook/HAT.

Unifying Top-down and Bottom-up Scanpath Prediction Using Transformers

TL;DR

The paper presents HAT, a unified transformer-based framework that predicts both top-down and bottom-up visual attention by integrating a foveated retina-inspired memory and dense per-pixel predictions. By formulating scanpath prediction as sequential dense heatmaps with termination signals, and by maintaining a dynamic working memory updated at each fixation, HAT achieves state-of-the-art performance across target-present, target-absent, and free-viewing tasks, while offering interpretable insights through peripheral contributions. The approach demonstrates strong generalization to unseen scenes and shows competitive results on OSIE and MIT1003, highlighting its robustness and applicability to diverse attention-demanding scenarios. The authors provide extensive ablations, qualitative analyses, and implementation details to validate the architecture's components and its advantages over previous fixation-discretization methods. Overall, HAT advances computational attention by unifying distinct attention controls under a single, interpretable, high-resolution, dense-prediction framework with practical implications for AR/VR and attention-aware processing.

Abstract

Most models of visual attention aim at predicting either top-down or bottom-up control, as studied using different visual search and free-viewing tasks. In this paper we propose the Human Attention Transformer (HAT), a single model that predicts both forms of attention control. HAT uses a novel transformer-based architecture and a simplified foveated retina that collectively create a spatio-temporal awareness akin to the dynamic visual working memory of humans. HAT not only establishes a new state-of-the-art in predicting the scanpath of fixations made during target-present and target-absent visual search and ``taskless'' free viewing, but also makes human gaze behavior interpretable. Unlike previous methods that rely on a coarse grid of fixation cells and experience information loss due to fixation discretization, HAT features a sequential dense prediction architecture and outputs a dense heatmap for each fixation, thus avoiding discretizing fixations. HAT sets a new standard in computational attention, which emphasizes effectiveness, generality, and interpretability. HAT's demonstrated scope and applicability will likely inspire the development of new attention models that can better predict human behavior in various attention-demanding scenarios. Code is available at https://github.com/cvlab-stonybrook/HAT.
Paper Structure (27 sections, 5 equations, 13 figures, 13 tables)

This paper contains 27 sections, 5 equations, 13 figures, 13 tables.

Figures (13)

  • Figure 1: Given an image, the proposed HAT is able to predict scanpaths under three settings target-present search for TV; target-absent scanpath for sink; and free viewing. Importantly, HAT outperforms previous state-of-the-art scanpath prediction methods on multiple datasets across three settings: target-present, target-absent visual search and free viewing, that were studied separately.
  • Figure 2: HAT overview. We use encoder-decoder CNNs to extract two sets of feature maps $P_1$ and $P_4$ of different spatial resolutions. A working memory with a capacity of $\lambda$ tokens is constructed by combining all feature vectors from $P_1$ with the feature vectors of $P_4$ at previously fixated locations, representing information extracted from the periphery and central fovea. A transformer encoder is used to dynamically update the working memory at every new fixation. Then, HAT produces $N$ per-task queries of dimension $C$ (e.g., clock search and mouse search), with each learning to aggregates task-specific information from the shared working memory for predicting the fixations for its own task. Finally, the updated queries are convolved with $P_4$ to yield the fixation heatmaps after a MLP layer, and projected to the termination probabilities in parallel. Note, although this figure depicts visual search, the framework also applies for free viewing.
  • Figure 3: Working memory construction. We construct the working memory by starting with the visual embeddings ("what") flattened from $P_1$ over the spatial axes and selected from $P_4$ at previous fixation locations. A scale embedding is introduced to capture scale information. Spatial embeddings and temporal embeddings are further added to the tokens to enhance the "where" and "when" signals. At every new fixation (marked in red), we simply add a new foveal token while keeping other tokens unchanged.
  • Figure 4: Visualization of the ground-truth human scanpaths and predicted scanpaths of different methods (columns). Three different settings (rows) including target-present bottle search, target-absent stop sign search and free viewing are shown from the top to bottom. The final fixation of each scanpath is highlighted in red circle. For methods without termination prediction, i.e., IRL, detector and fixation heuristic, we visualize the first 6 fixations for visual search and 15 for free viewing. The rightmost column shows the predicted scanpaths of the heuristic methods (detector 630 for visual search and fixation heuristic for free-viewing)
  • Figure 5: Visualization of the predicted scanpath, peripheral contribution map and fixation heatmap (columns) of HAT for target-present laptop visual search examples at every fixation (rows). We also include the predicted termination probability $\tau$ for each step on the left. The model terminates searching if $\tau>0.5$.
  • ...and 8 more figures