Table of Contents
Fetching ...

EyeFormer: Predicting Personalized Scanpaths with Transformer-Guided Reinforcement Learning

Yue Jiang, Zixin Guo, Hamed Rezazadegan Tavakoli, Luis A. Leiva, Antti Oulasvirta

TL;DR

EyeFormer is introduced, which utilizes a Transformer architecture as a policy network to guide a deep reinforcement learning algorithm that predicts gaze locations and can predict full scanpath information, including fixation positions and durations, across individuals and various stimulus types.

Abstract

From a visual perception perspective, modern graphical user interfaces (GUIs) comprise a complex graphics-rich two-dimensional visuospatial arrangement of text, images, and interactive objects such as buttons and menus. While existing models can accurately predict regions and objects that are likely to attract attention ``on average'', so far there is no scanpath model capable of predicting scanpaths for an individual. To close this gap, we introduce EyeFormer, which leverages a Transformer architecture as a policy network to guide a deep reinforcement learning algorithm that controls gaze locations. Our model has the unique capability of producing personalized predictions when given a few user scanpath samples. It can predict full scanpath information, including fixation positions and duration, across individuals and various stimulus types. Additionally, we demonstrate applications in GUI layout optimization driven by our model. Our software and models will be publicly available.

EyeFormer: Predicting Personalized Scanpaths with Transformer-Guided Reinforcement Learning

TL;DR

EyeFormer is introduced, which utilizes a Transformer architecture as a policy network to guide a deep reinforcement learning algorithm that predicts gaze locations and can predict full scanpath information, including fixation positions and durations, across individuals and various stimulus types.

Abstract

From a visual perception perspective, modern graphical user interfaces (GUIs) comprise a complex graphics-rich two-dimensional visuospatial arrangement of text, images, and interactive objects such as buttons and menus. While existing models can accurately predict regions and objects that are likely to attract attention ``on average'', so far there is no scanpath model capable of predicting scanpaths for an individual. To close this gap, we introduce EyeFormer, which leverages a Transformer architecture as a policy network to guide a deep reinforcement learning algorithm that controls gaze locations. Our model has the unique capability of producing personalized predictions when given a few user scanpath samples. It can predict full scanpath information, including fixation positions and duration, across individuals and various stimulus types. Additionally, we demonstrate applications in GUI layout optimization driven by our model. Our software and models will be publicly available.
Paper Structure (40 sections, 11 equations, 7 figures, 3 tables)

This paper contains 40 sections, 11 equations, 7 figures, 3 tables.

Figures (7)

  • Figure 1: The mechanism of adapting the inhibition-of-return area in a display to compute the salient-value reward involves modifying the radius of the inhibition area, which is determined by the disparity between the size of the saliency map and the image size on the display. a) The diameter of the display's inhibition areas $m_{\textrm{display}}$ is commensurate with human's visual angle. b) We then compute $m_{\textrm{orig}}$, the diameter for the corresponding inhibition areas for the input image with size $w_\mathcal{I} \times h_\mathcal{I}$. c) The image needs to be resized to dimensions $w_\textrm{inp} \times h_\textrm{inp}$, which corresponds to the input image size required by the model. Thus, the inhibition areas are rendered, accordingly, as ellipses with radii $m_w$ and $m_h$.
  • Figure 2: Overview of our Transformer-guided Reinforcement Learning framework for scanpath prediction. It comprises several components: the environment, which produces the state of the input image and previous fixation points; the Transformer model, which furnishes the policy; the policy-generated action, predicting the next point in the scanpath; and the reward function (obtained from evaluating the action against ground truth), through which the policy gets updated. Within the Transformer policy model, the image patches, resized and split from the input image, are fed to the vision encoder to get the image embedding; the viewer encoder generates the viewer embedding to distinguish between viewers (only for individual-level prediction); the fixation decoder takes the image and viewer embeddings along with previously generated fixations to sequentially generates the following points along the scanpath. During training, the model begins with sampling the next point from the distribution generated by the policy in light of the current state. Then, this sampled point is used to update the state of the environment, and incorporating the reward indicated via ground truth serves to update the Transformer policy model. During testing, we directly use the policy model to generate the scanpaths.
  • Figure 3: Scanpaths personalized for two viewers, illustrating our model's ability to generate these by means of only a few scanpath samples from each viewer (note that "Viewer 1" and "Viewer 2" are generic terms; the viewers are not the same across all examples). More examples are presented in Supplementary Materials.
  • Figure 4: Our population-level scanpath prediction shows that they are close to the ground truth (GT) regarding fixation positions, ordering, and duration. More examples are presented in Supplementary Materials.
  • Figure 5: Annotated comparison between different models. This illustration presents the best baseline methods, with annotated limitations.
  • ...and 2 more figures