Table of Contents
Fetching ...

Unified Dynamic Scanpath Predictors Outperform Individually Trained Neural Models

Fares Abawi, Di Fu, Stefan Wermter

TL;DR

This work addresses the heterogeneity of human scanpaths by introducing a fixation history module to a GASP-based dynamic scanpath predictor, enabling a single unified model to predict multiple observers' gaze trajectories in social video stimuli. By combining social cues with a fixation-history channel and employing late integration (ARGMU/LARGMU variants), the approach achieves performance on par with or better than individually trained models while maintaining scalability. Key findings show that the unified model benefits from universal attention learned from group data, while fixation history injects personalized targeting, and that late integration offers robustness across longer prediction horizons and larger datasets. The results have practical implications for social human-robot interaction and cognitive simulations, where scalable personalization without per-observer models is desirable, though challenges such as non-determinism and cue reliability remain for future work.

Abstract

Previous research on scanpath prediction has mainly focused on group models, disregarding the fact that the scanpaths and attentional behaviors of individuals are diverse. The disregard of these differences is especially detrimental to social human-robot interaction, whereby robots commonly emulate human gaze based on heuristics or predefined patterns. However, human gaze patterns are heterogeneous and varying behaviors can significantly affect the outcomes of such human-robot interactions. To fill this gap, we developed a deep learning-based social cue integration model for saliency prediction to instead predict scanpaths in videos. Our model learned scanpaths by recursively integrating fixation history and social cues through a gating mechanism and sequential attention. We evaluated our approach on gaze datasets of dynamic social scenes, observed under the free-viewing condition. The introduction of fixation history into our models makes it possible to train a single unified model rather than the resource-intensive approach of training individual models for each set of scanpaths. We observed that the late neural integration approach surpasses early fusion when training models on a large dataset, in comparison to a smaller dataset with a similar distribution. Results also indicate that a single unified model, trained on all the observers' scanpaths, performs on par or better than individually trained models. We hypothesize that this outcome is a result of the group saliency representations instilling universal attention in the model, while the supervisory signal and fixation history guide it to learn personalized attentional behaviors, providing the unified model a benefit over individual models due to its implicit representation of universal attention.

Unified Dynamic Scanpath Predictors Outperform Individually Trained Neural Models

TL;DR

This work addresses the heterogeneity of human scanpaths by introducing a fixation history module to a GASP-based dynamic scanpath predictor, enabling a single unified model to predict multiple observers' gaze trajectories in social video stimuli. By combining social cues with a fixation-history channel and employing late integration (ARGMU/LARGMU variants), the approach achieves performance on par with or better than individually trained models while maintaining scalability. Key findings show that the unified model benefits from universal attention learned from group data, while fixation history injects personalized targeting, and that late integration offers robustness across longer prediction horizons and larger datasets. The results have practical implications for social human-robot interaction and cognitive simulations, where scalable personalization without per-observer models is desirable, though challenges such as non-determinism and cue reliability remain for future work.

Abstract

Previous research on scanpath prediction has mainly focused on group models, disregarding the fact that the scanpaths and attentional behaviors of individuals are diverse. The disregard of these differences is especially detrimental to social human-robot interaction, whereby robots commonly emulate human gaze based on heuristics or predefined patterns. However, human gaze patterns are heterogeneous and varying behaviors can significantly affect the outcomes of such human-robot interactions. To fill this gap, we developed a deep learning-based social cue integration model for saliency prediction to instead predict scanpaths in videos. Our model learned scanpaths by recursively integrating fixation history and social cues through a gating mechanism and sequential attention. We evaluated our approach on gaze datasets of dynamic social scenes, observed under the free-viewing condition. The introduction of fixation history into our models makes it possible to train a single unified model rather than the resource-intensive approach of training individual models for each set of scanpaths. We observed that the late neural integration approach surpasses early fusion when training models on a large dataset, in comparison to a smaller dataset with a similar distribution. Results also indicate that a single unified model, trained on all the observers' scanpaths, performs on par or better than individually trained models. We hypothesize that this outcome is a result of the group saliency representations instilling universal attention in the model, while the supervisory signal and fixation history guide it to learn personalized attentional behaviors, providing the unified model a benefit over individual models due to its implicit representation of universal attention.
Paper Structure (26 sections, 8 equations, 8 figures, 5 tables, 1 algorithm)

This paper contains 26 sections, 8 equations, 8 figures, 5 tables, 1 algorithm.

Figures (8)

  • Figure 1: Our modification of the GASP abawi2021gasp model, transforming it into a scanpath prediction model by including a fixation history module. The GASP model integrates the facial expression and gaze direction social cue representations with the DAVE tavakoli2020deep model's fixation density maps. By additionally integrating the fixation history as another modality, we are able to specify the observer whose scanpath is to be inferred. We provide a sequence with a specific context size (number of input frames) consisting of prior fixations and spatiotemporal representations. In this example, we employ the late integration GASP variant (DAM + LARGMU, context size $T'=10$) predicting scanpaths of three observers (G1 - G3). At each timestep of multi-step-ahead (five steps) fixation, predictions (P1 - P3) indicate that the slightest divergence from the ground-truth has a noticeable impact on future predictions.
  • Figure 2: Our two GASP abawi2021gasp variants extended with fixation history modules for predicting scanpaths, where (a) is the modality fusion variant ARGMU, and (b) is the non-fusion late integration model LARGMU. The directed attention module (DAM) is applied to each variant with the fixation density maps for the entire sequence as ground-truth during training. $T'$ represents the context size (number of input frames) for each model, whereas $t'$ indicates the current timestep (frame index) in the video. $\mathbf{\hat{m}}^{\langle t' \rangle}$ represents the priority map predicted by the model at timestep $t'$. SP: Saliency Prediction Representation; GE: Gaze Direction Estimation Represention; FER: Facial Expression Recognition Representation.
  • Figure 3: The individual model 1 vs 1 and 1 vs infinity evaluations on the FindWho xu2018findwho and MVVA liu2020mvva datasets, across the two GASP abawi2021gasp variants extended with fixation history modules. (a,d) visualize the mean values of the scores across all samples. $*\!*$ denotes $.001 < p < .01$ and $*\!*\!*$$p < .001$
  • Figure 4: The individual model 1 vs 1 and 1 vs infinity evaluations on the FindWho xu2018findwho and MVVA liu2020mvva datasets, across the two GASP abawi2021gasp variants extended with fixation history modules. (a,d) visualize the standard deviation of the scores across all testing videos per individual observer. $*$ denotes $.01 < p < .05$, $*\!*$$.001 < p < .01$, $*\!*\!*$$p < .001$, and n.s. denotes no significance.
  • Figure 5: The individual model 1 vs 1 and 1 vs infinity evaluations on the FindWho xu2018findwho dataset, across the two GASP abawi2021gasp variants extended with fixation history modules. The AUCJ scores are measured for the (a) integration and (b) fusion architectures, as well as NSS scores for the (c) integration and (d) fusion architectures. $*\!*\!*$ denotes $p < .001$.
  • ...and 3 more figures