Table of Contents
Fetching ...

Quantitative and Qualitative Comparison of Generative Models for Subject-Specific Gaze Synthesis: Diffusion vs GAN

Kamrul Hasan, Dmytro Katrychuk, Mehedi Hasan Raju, Oleg V. Komogortsev

TL;DR

This study addresses the need for subject-specific gaze data by comparing diffusion-based DiffEyeSyn and GAN-based SP-EyeGAN, each enhanced with compact, subject-aware conditioning. DiffEyeSyn uses a downsampled identity-removal pipeline plus a 128-dimensional user embedding to guide generation, while SP-EyeGAN integrates a Subject-Specific Condition Generator combining DDQFE features with one-hot identity. Across experiments on the GazeBase dataset, DiffEyeSyn yields higher synthetic-real similarity and substantially lower spatial errors and jitter than SP-EyeGAN, indicating superior capture of individual gaze dynamics. The results support the feasibility of privacy-conscious, subject-aware gaze synthesis for AR/VR and related applications, while also outlining directions for further improving temporal fidelity and conditioning strategies.

Abstract

Recent advances in deep learning demonstrate the ability to generate synthetic gaze data. However, most approaches have primarily focused on generating data from random noise distributions or global, predefined latent embeddings, whereas individualized gaze sequence generation has been less explored. To address this gap, we revisit two recent approaches based on diffusion and generative adversarial networks (GANs) and introduce modifications that make both models explicitly subject-aware while improving accuracy and effectiveness. For the diffusion-based approach, we utilize compact user embeddings that emphasize per-subject traits. Moreover, for the GAN-based approach, we propose a subject-specific synthesis module that conditioned the generator to retain better idiosyncratic gaze information. Finally, we conduct a comprehensive assessment of these modified approaches utilizing standard eye-tracking signal quality metrics, including spatial accuracy and precision. This work helps define synthetic signal quality, realism, and subject specificity, thereby contributing to the potential development of gaze-based applications.

Quantitative and Qualitative Comparison of Generative Models for Subject-Specific Gaze Synthesis: Diffusion vs GAN

TL;DR

This study addresses the need for subject-specific gaze data by comparing diffusion-based DiffEyeSyn and GAN-based SP-EyeGAN, each enhanced with compact, subject-aware conditioning. DiffEyeSyn uses a downsampled identity-removal pipeline plus a 128-dimensional user embedding to guide generation, while SP-EyeGAN integrates a Subject-Specific Condition Generator combining DDQFE features with one-hot identity. Across experiments on the GazeBase dataset, DiffEyeSyn yields higher synthetic-real similarity and substantially lower spatial errors and jitter than SP-EyeGAN, indicating superior capture of individual gaze dynamics. The results support the feasibility of privacy-conscious, subject-aware gaze synthesis for AR/VR and related applications, while also outlining directions for further improving temporal fidelity and conditioning strategies.

Abstract

Recent advances in deep learning demonstrate the ability to generate synthetic gaze data. However, most approaches have primarily focused on generating data from random noise distributions or global, predefined latent embeddings, whereas individualized gaze sequence generation has been less explored. To address this gap, we revisit two recent approaches based on diffusion and generative adversarial networks (GANs) and introduce modifications that make both models explicitly subject-aware while improving accuracy and effectiveness. For the diffusion-based approach, we utilize compact user embeddings that emphasize per-subject traits. Moreover, for the GAN-based approach, we propose a subject-specific synthesis module that conditioned the generator to retain better idiosyncratic gaze information. Finally, we conduct a comprehensive assessment of these modified approaches utilizing standard eye-tracking signal quality metrics, including spatial accuracy and precision. This work helps define synthetic signal quality, realism, and subject specificity, thereby contributing to the potential development of gaze-based applications.

Paper Structure

This paper contains 27 sections, 10 equations, 2 figures, 3 tables.

Figures (2)

  • Figure 1: (a) Overview of the updated DiffEyeSyn architecture: The original positional signal $p$ is processed through the Identity Removal module and an SGDF filter to produce an identity-removed velocity $v_0$, while using only the SGDF yields the raw velocity $v$. At each diffusion step $t$, noise $x_t$ and $v_0$ serve as input to the diffusion model, which is conditioned on user embeddings from the pre-trained Eye Know You Too encoder. The model predicts noise ($\hat{\varepsilon}_t$), which is then converted into the predicted velocity $\hat{v}$. (b) Overview of the updated SP-EyeGAN architecture: The Subject-Specific Condition Generator (SCG), composed of a DDQFE and a one-hot (OH) encoder, extracts conditional features from the original velocity sequences and metadata. The generator uses random noise and these features to produce synthetic eye movements (fixations or saccades), depending on whether the FixGAN or SacGAN model is used.
  • Figure 2: Qualitative comparison between SP-EyeGAN and DiffEyeSyn for two different tasks: (a) TEX and (b) RAN. For each task, the first row contains the positional signal, and the second row contains the velocity signal.