Quantitative and Qualitative Comparison of Generative Models for Subject-Specific Gaze Synthesis: Diffusion vs GAN
Kamrul Hasan, Dmytro Katrychuk, Mehedi Hasan Raju, Oleg V. Komogortsev
TL;DR
This study addresses the need for subject-specific gaze data by comparing diffusion-based DiffEyeSyn and GAN-based SP-EyeGAN, each enhanced with compact, subject-aware conditioning. DiffEyeSyn uses a downsampled identity-removal pipeline plus a 128-dimensional user embedding to guide generation, while SP-EyeGAN integrates a Subject-Specific Condition Generator combining DDQFE features with one-hot identity. Across experiments on the GazeBase dataset, DiffEyeSyn yields higher synthetic-real similarity and substantially lower spatial errors and jitter than SP-EyeGAN, indicating superior capture of individual gaze dynamics. The results support the feasibility of privacy-conscious, subject-aware gaze synthesis for AR/VR and related applications, while also outlining directions for further improving temporal fidelity and conditioning strategies.
Abstract
Recent advances in deep learning demonstrate the ability to generate synthetic gaze data. However, most approaches have primarily focused on generating data from random noise distributions or global, predefined latent embeddings, whereas individualized gaze sequence generation has been less explored. To address this gap, we revisit two recent approaches based on diffusion and generative adversarial networks (GANs) and introduce modifications that make both models explicitly subject-aware while improving accuracy and effectiveness. For the diffusion-based approach, we utilize compact user embeddings that emphasize per-subject traits. Moreover, for the GAN-based approach, we propose a subject-specific synthesis module that conditioned the generator to retain better idiosyncratic gaze information. Finally, we conduct a comprehensive assessment of these modified approaches utilizing standard eye-tracking signal quality metrics, including spatial accuracy and precision. This work helps define synthetic signal quality, realism, and subject specificity, thereby contributing to the potential development of gaze-based applications.
