Table of Contents
Fetching ...

Beyond Scanpaths: Graph-Based Gaze Simulation in Dynamic Scenes

Luke Palmer, Petar Palasek, Hazem Abdelkawy

Abstract

Accurately modelling human attention is essential for numerous computer vision applications, particularly in the domain of automotive safety. Existing methods typically collapse gaze into saliency maps or scanpaths, treating gaze dynamics only implicitly. We instead formulate gaze modelling as an autoregressive dynamical system and explicitly unroll raw gaze trajectories over time, conditioned on both gaze history and the evolving environment. Driving scenes are represented as gaze-centric graphs processed by the Affinity Relation Transformer (ART), a heterogeneous graph transformer that models interactions between driver gaze, traffic objects, and road structure. We further introduce the Object Density Network (ODN) to predict next-step gaze distributions, capturing the stochastic and object-centric nature of attentional shifts in complex environments. We also release Focus100, a new dataset of raw gaze data from 30 participants viewing egocentric driving footage. Trained directly on raw gaze, without fixation filtering, our unified approach produces more natural gaze trajectories, scanpath dynamics, and saliency maps than existing attention models, offering valuable insights for the temporal modelling of human attention in dynamic environments.

Beyond Scanpaths: Graph-Based Gaze Simulation in Dynamic Scenes

Abstract

Accurately modelling human attention is essential for numerous computer vision applications, particularly in the domain of automotive safety. Existing methods typically collapse gaze into saliency maps or scanpaths, treating gaze dynamics only implicitly. We instead formulate gaze modelling as an autoregressive dynamical system and explicitly unroll raw gaze trajectories over time, conditioned on both gaze history and the evolving environment. Driving scenes are represented as gaze-centric graphs processed by the Affinity Relation Transformer (ART), a heterogeneous graph transformer that models interactions between driver gaze, traffic objects, and road structure. We further introduce the Object Density Network (ODN) to predict next-step gaze distributions, capturing the stochastic and object-centric nature of attentional shifts in complex environments. We also release Focus100, a new dataset of raw gaze data from 30 participants viewing egocentric driving footage. Trained directly on raw gaze, without fixation filtering, our unified approach produces more natural gaze trajectories, scanpath dynamics, and saliency maps than existing attention models, offering valuable insights for the temporal modelling of human attention in dynamic environments.

Paper Structure

This paper contains 73 sections, 13 equations, 16 figures, 4 tables.

Figures (16)

  • Figure 1: To model driver attention as part of a dynamical system we encode traffic scenes as heterogeneous scene graphs with nodes corresponding to road structure, driving-relevant objects, and egocentric gaze, representing the driver's foveated field-of-view. The Affinity Relation Transformer processes these graphs to predict next-step gaze probability distributions. Our dynamical systems approach generates state of the art gaze timeseries, scanpaths, and saliency maps from a single model.
  • Figure 2: Using upstream perception modules and observed gaze position, synchronised driving video and gaze are converted into a spatiotemporal heterogeneous scene graph with nodes for traffic agents, road structure, and driver foveal view. Each node is assigned a feature vector including appearance and depth, while edges represent the spatiotemporal differences and appearance similarities between nodes. Scene graphs are processed by Affinity Relation Transformer (ART) blocks before an Object Density Network (ODN) predicts a Gaussian-mixture distribution for the next gaze position. We train the model with negative log-likelihood of ground-truth gaze under the predicted mixture. For simulation we employ autoregressive rollout by sampling from the mixture, updating the graph with the sampled position, and repeating; simulated gaze sequences can then be post-processed into scanpaths and saliency maps without additional training.
  • Figure 3: ART computes messages for each pair of connected source and destination nodes $\mathbf{v}_j$ and $\mathbf{v}_i$ in the input graph, incorporating their relative affinity $\mathbf{a}_{i, j}$ into each message. Messages are aggregated into updated destination node vectors, $\tilde{\mathbf{x}}'_i$. Our novel relative affinity embeddings are highlighted in red.
  • Figure 4: Violin plots of vehicle (left) and pedestrian (right) count per-frame in the Focus100, MAAD, and DR(eye)VE datasets.
  • Figure 5: Gaze sequences and saliency maps generated on a 15 clip of Focus100. The first column shows human gaze sequences, followed by those generated by ART, SCOUT, ViNet and GLC models. Each trace represents a single simulation, with the y-axis indicating time and x-axis showing left-to-right gaze position; blue marks to the left show detected fixations, and average fixation duration (FD) per method is given. On the right, we display observed fixations for humans and model-generated saliency maps for the same video frames, temporally aligned with the gaze sequences for direct comparison. See the Supplementary for further examples.
  • ...and 11 more figures