Table of Contents
Fetching ...

DiffEye: Diffusion-Based Continuous Eye-Tracking Data Generation Conditioned on Natural Images

Ozgur Kara, Harris Nisar, James M. Rehg

TL;DR

This paper introduces DiffEye, a diffusion-based framework for generating continuous, diverse eye movement trajectories conditioned on natural images. It directly uses raw eye-tracking trajectories rather than discretized scanpaths, enabling realistic inter-subject variability and the production of outputs that can be converted into scanpaths or saliency maps. Key contributions include a novel Corresponding Positional Embedding (CPE) that aligns gaze coordinates with patch-based image features, a high-resolution FeatUp conditioning pipeline, and an end-to-end diffusion model trained on MIT1003. DiffEye achieves state-of-the-art performance in scanpath generation, enables continuous trajectory synthesis for natural images, and demonstrates generalization to unseen data, with potential applications in developmental psychology and data augmentation for gaze modeling. Limitations include reliance on a relatively small dataset and fixed-length trajectories, with future work exploring transfer learning, data sharing, and variable-length sequence generation.

Abstract

Numerous models have been developed for scanpath and saliency prediction, which are typically trained on scanpaths, which model eye movement as a sequence of discrete fixation points connected by saccades, while the rich information contained in the raw trajectories is often discarded. Moreover, most existing approaches fail to capture the variability observed among human subjects viewing the same image. They generally predict a single scanpath of fixed, pre-defined length, which conflicts with the inherent diversity and stochastic nature of real-world visual attention. To address these challenges, we propose DiffEye, a diffusion-based training framework designed to model continuous and diverse eye movement trajectories during free viewing of natural images. Our method builds on a diffusion model conditioned on visual stimuli and introduces a novel component, namely Corresponding Positional Embedding (CPE), which aligns spatial gaze information with the patch-based semantic features of the visual input. By leveraging raw eye-tracking trajectories rather than relying on scanpaths, DiffEye captures the inherent variability in human gaze behavior and generates high-quality, realistic eye movement patterns, despite being trained on a comparatively small dataset. The generated trajectories can also be converted into scanpaths and saliency maps, resulting in outputs that more accurately reflect the distribution of human visual attention. DiffEye is the first method to tackle this task on natural images using a diffusion model while fully leveraging the richness of raw eye-tracking data. Our extensive evaluation shows that DiffEye not only achieves state-of-the-art performance in scanpath generation but also enables, for the first time, the generation of continuous eye movement trajectories. Project webpage: https://diff-eye.github.io/

DiffEye: Diffusion-Based Continuous Eye-Tracking Data Generation Conditioned on Natural Images

TL;DR

This paper introduces DiffEye, a diffusion-based framework for generating continuous, diverse eye movement trajectories conditioned on natural images. It directly uses raw eye-tracking trajectories rather than discretized scanpaths, enabling realistic inter-subject variability and the production of outputs that can be converted into scanpaths or saliency maps. Key contributions include a novel Corresponding Positional Embedding (CPE) that aligns gaze coordinates with patch-based image features, a high-resolution FeatUp conditioning pipeline, and an end-to-end diffusion model trained on MIT1003. DiffEye achieves state-of-the-art performance in scanpath generation, enables continuous trajectory synthesis for natural images, and demonstrates generalization to unseen data, with potential applications in developmental psychology and data augmentation for gaze modeling. Limitations include reliance on a relatively small dataset and fixed-length trajectories, with future work exploring transfer learning, data sharing, and variable-length sequence generation.

Abstract

Numerous models have been developed for scanpath and saliency prediction, which are typically trained on scanpaths, which model eye movement as a sequence of discrete fixation points connected by saccades, while the rich information contained in the raw trajectories is often discarded. Moreover, most existing approaches fail to capture the variability observed among human subjects viewing the same image. They generally predict a single scanpath of fixed, pre-defined length, which conflicts with the inherent diversity and stochastic nature of real-world visual attention. To address these challenges, we propose DiffEye, a diffusion-based training framework designed to model continuous and diverse eye movement trajectories during free viewing of natural images. Our method builds on a diffusion model conditioned on visual stimuli and introduces a novel component, namely Corresponding Positional Embedding (CPE), which aligns spatial gaze information with the patch-based semantic features of the visual input. By leveraging raw eye-tracking trajectories rather than relying on scanpaths, DiffEye captures the inherent variability in human gaze behavior and generates high-quality, realistic eye movement patterns, despite being trained on a comparatively small dataset. The generated trajectories can also be converted into scanpaths and saliency maps, resulting in outputs that more accurately reflect the distribution of human visual attention. DiffEye is the first method to tackle this task on natural images using a diffusion model while fully leveraging the richness of raw eye-tracking data. Our extensive evaluation shows that DiffEye not only achieves state-of-the-art performance in scanpath generation but also enables, for the first time, the generation of continuous eye movement trajectories. Project webpage: https://diff-eye.github.io/

Paper Structure

This paper contains 29 sections, 3 equations, 9 figures, 4 tables, 1 algorithm.

Figures (9)

  • Figure 1: Comparison of different eye-tracking data types. (a) Original visual stimulus. (b) Saliency maps highlight regions of interest but do not capture the temporal dynamics of human attention. (c) Scanpaths offer a compressed representation of eye movement trajectories. (d) Full eye movement trajectories, recorded via eye trackers, provide detailed insights into attention dynamics. This example is from the MIT1003 dataset Judd2009; each color represents a different subject, emphasizing the importance of modeling inter-subject variability.
  • Figure 2: An illustration of DiffEye.(a) End-to-end Training. Given an initial trajectory $R^{(0)}$ and image $I$, noise is added to produce $R^{t_{\text{diff}}}$. FeatUp extracts patch features $F_{\text{patch}}$, and both inputs are passed to the (b) CPE module, which aligns trajectory and patch positions. The resulting representations $R^{\text{CPE}}$ and $F_{\text{CPE}}$ are processed by a U-Net with cross-attention at each block and optimized via diffusion loss. (c) Inference. Starting from noise, the model denoises for $T$ steps to generate an eye movement trajectory, which can be used to produce scanpaths or saliency maps.
  • Figure 3: Qualitative comparison of scanpath generation. Scanpaths generated by DiffEye and baseline models are shown alongside ground truth annotations across four different scenes. Each row represents a unique stimulus, and each column shows the corresponding scanpaths generated from a specific method.
  • Figure 4: Qualitative analysis and ablation study of continuous eye movement trajectory generation. (a) Multiple eye movement trajectories generated by DiffEye) alongside ground truth annotations across four different scenes. (b) Ablation study showing the impact of removing individual architectural components (FeatUp, CPE, cross-attention, and patch-level features) on continuous trajectory generation.
  • Figure 5: Qualitative comparison of saliency map predictions. Saliency maps generated by DiffEye and baseline models are shown alongside ground truth maps for four different scenes. Each row corresponds to a different stimulus, with columns displaying the stimulus, ground truth, and predictions.
  • ...and 4 more figures