Table of Contents
Fetching ...

DiffGaze: A Diffusion Model for Continuous Gaze Sequence Generation on 360° Images

Chuhan Jiao, Yao Wang, Guanhua Zhang, Mihai Bâce, Zhiming Hu, Andreas Bulling

TL;DR

DiffGaze is presented, a novel method for generating realistic and diverse continuous human gaze sequences on 360{\deg} images based on a conditional score-based denoising diffusion model that outperforms state-of-the-art methods on all tasks on both benchmarks.

Abstract

We present DiffGaze, a novel method for generating realistic and diverse continuous human gaze sequences on 360° images based on a conditional score-based denoising diffusion model. Generating human gaze on 360° images is important for various human-computer interaction and computer graphics applications, e.g. for creating large-scale eye tracking datasets or for realistic animation of virtual humans. However, existing methods are limited to predicting discrete fixation sequences or aggregated saliency maps, thereby neglecting crucial parts of natural gaze behaviour. Our method uses features extracted from 360° images as condition and uses two transformers to model the temporal and spatial dependencies of continuous human gaze. We evaluate DiffGaze on two 360° image benchmarks for gaze sequence generation as well as scanpath prediction and saliency prediction. Our evaluations show that DiffGaze outperforms state-of-the-art methods on all tasks on both benchmarks. We also report a 21-participant user study showing that our method generates gaze sequences that are indistinguishable from real human sequences.

DiffGaze: A Diffusion Model for Continuous Gaze Sequence Generation on 360° Images

TL;DR

DiffGaze is presented, a novel method for generating realistic and diverse continuous human gaze sequences on 360{\deg} images based on a conditional score-based denoising diffusion model that outperforms state-of-the-art methods on all tasks on both benchmarks.

Abstract

We present DiffGaze, a novel method for generating realistic and diverse continuous human gaze sequences on 360° images based on a conditional score-based denoising diffusion model. Generating human gaze on 360° images is important for various human-computer interaction and computer graphics applications, e.g. for creating large-scale eye tracking datasets or for realistic animation of virtual humans. However, existing methods are limited to predicting discrete fixation sequences or aggregated saliency maps, thereby neglecting crucial parts of natural gaze behaviour. Our method uses features extracted from 360° images as condition and uses two transformers to model the temporal and spatial dependencies of continuous human gaze. We evaluate DiffGaze on two 360° image benchmarks for gaze sequence generation as well as scanpath prediction and saliency prediction. Our evaluations show that DiffGaze outperforms state-of-the-art methods on all tasks on both benchmarks. We also report a 21-participant user study showing that our method generates gaze sequences that are indistinguishable from real human sequences.
Paper Structure (23 sections, 10 equations, 5 figures, 4 tables)

This paper contains 23 sections, 10 equations, 5 figures, 4 tables.

Figures (5)

  • Figure -1: Overview of our proposed DiffGaze method. We cast continuous gaze sequence generation as a conditional diffusion task. Our model is trained to recover the original gaze trajectory from the corrupted, noisy data. The condition to guide this diffusion process includes a spherical convolution for the 360$^{\circ}$ image, and side information (time and feature embedding). We apply two Transformers to learn both the temporal and spatial attention. Please refer to the text for details about the architecture, diffusion process and the loss function.
  • Figure 0: Qualitative comparison of continuous gaze data generation models in four scenes. From left to right: gaze samples from a human observer, generated 30 Hz eye movement sequences from the ScanGAN360 method, ScanDMM, and our proposed model. From top to bottom: the Room and the Robots from the Sitzmann dataset, the Museum and Resort from the Salient360! dataset.
  • Figure 1: User ratings of the realism of gaze sequences generated by DiffGaze, ScanDMM, and ScanGAN360 (1: highly unrealistic, 10: highly realistic).
  • Figure 2: Qualitative comparison to scanpath prediction models in four scenes. From left to right: scanpaths obtained by a human observer, generated 30 Hz scanpaths obtained by ScanGAN360, ScanDMM, and the proposed model. From top to bottom: the Room and the Square from Sitzmann dataset, the Museum and Autumn from Salient360! dataset.
  • Figure 3: Qualitative comparison to saliency prediction models in four scenes. From left to right: scanpaths obtained by a human observer, generated saliency maps obtained by ScanGAN360, ScanDMM, and the proposed model. From top to bottom: the Room and the Robots from Sitzmann dataset, the Mall and Gallery from Salient360! dataset.