Table of Contents
Fetching ...

ScanGAN360: A Generative Model of Realistic Scanpaths for 360$^{\circ}$ Images

Daniel Martin, Ana Serrano, Alexander W. Bergman, Gordon Wetzstein, Belen Masia

TL;DR

ScanGAN360 presents a conditional GAN for generating realistic 360° gaze scanpaths by combining a sphere-aware 3D scanpath parameterization with a spherical DTW-based loss. A two-branch architecture uses panoramic convolutions and CoordConv to accommodate 360° distortions, enabling generation of long, diverse scanpaths (up to 30 seconds) at high speed (~$10^3$ scanpaths/s) without ground-truth one-to-one mappings. The approach outperforms prior 360° scanpath methods and closely approaches human baselines across multiple datasets, with thorough ablations validating the DTW_sph loss and architectural choices. Behavioral analyses show realistic exploration dynamics, equator bias, and inter-observer congruency, supporting practical deployment in VR content design, scanpath-driven thumbnails, and avatar gaze. The work also discusses limitations (fixed length, sampling rate) and outlines future directions for variable-length trajectories and low-level oculomotor dynamics, with code and models released for further research.

Abstract

Understanding and modeling the dynamics of human gaze behavior in 360$^\circ$ environments is a key challenge in computer vision and virtual reality. Generative adversarial approaches could alleviate this challenge by generating a large number of possible scanpaths for unseen images. Existing methods for scanpath generation, however, do not adequately predict realistic scanpaths for 360$^\circ$ images. We present ScanGAN360, a new generative adversarial approach to address this challenging problem. Our network generator is tailored to the specifics of 360$^\circ$ images representing immersive environments. Specifically, we accomplish this by leveraging the use of a spherical adaptation of dynamic-time warping as a loss function and proposing a novel parameterization of 360$^\circ$ scanpaths. The quality of our scanpaths outperforms competing approaches by a large margin and is almost on par with the human baseline. ScanGAN360 thus allows fast simulation of large numbers of virtual observers, whose behavior mimics real users, enabling a better understanding of gaze behavior and novel applications in virtual scene design.

ScanGAN360: A Generative Model of Realistic Scanpaths for 360$^{\circ}$ Images

TL;DR

ScanGAN360 presents a conditional GAN for generating realistic 360° gaze scanpaths by combining a sphere-aware 3D scanpath parameterization with a spherical DTW-based loss. A two-branch architecture uses panoramic convolutions and CoordConv to accommodate 360° distortions, enabling generation of long, diverse scanpaths (up to 30 seconds) at high speed (~ scanpaths/s) without ground-truth one-to-one mappings. The approach outperforms prior 360° scanpath methods and closely approaches human baselines across multiple datasets, with thorough ablations validating the DTW_sph loss and architectural choices. Behavioral analyses show realistic exploration dynamics, equator bias, and inter-observer congruency, supporting practical deployment in VR content design, scanpath-driven thumbnails, and avatar gaze. The work also discusses limitations (fixed length, sampling rate) and outlines future directions for variable-length trajectories and low-level oculomotor dynamics, with code and models released for further research.

Abstract

Understanding and modeling the dynamics of human gaze behavior in 360 environments is a key challenge in computer vision and virtual reality. Generative adversarial approaches could alleviate this challenge by generating a large number of possible scanpaths for unseen images. Existing methods for scanpath generation, however, do not adequately predict realistic scanpaths for 360 images. We present ScanGAN360, a new generative adversarial approach to address this challenging problem. Our network generator is tailored to the specifics of 360 images representing immersive environments. Specifically, we accomplish this by leveraging the use of a spherical adaptation of dynamic-time warping as a loss function and proposing a novel parameterization of 360 scanpaths. The quality of our scanpaths outperforms competing approaches by a large margin and is almost on par with the human baseline. ScanGAN360 thus allows fast simulation of large numbers of virtual observers, whose behavior mimics real users, enabling a better understanding of gaze behavior and novel applications in virtual scene design.

Paper Structure

This paper contains 34 sections, 10 equations, 56 figures, 7 tables.

Figures (56)

  • Figure 1: We present ScanGAN360, a generative adversarial approach to scanpath generation for 360$^{\circ}$ images. ScanGAN360 generates realistic scanpaths (bottom rows), outperforming state-of-the-art methods and mimicking the human baseline (top row).
  • Figure 2: Illustration of our generator and discriminator networks. Both networks have a two-branch structure: Features extracted from the 360$^{\circ}$ image with the aid of a CoordConv layer and an encoder-like network are concatenated with the input vector for further processing. The generator learns to transform this input vector, conditioned by the image, into a plausible scanpath. The discriminator takes as input vector a scanpath (either captured or synthesized by the generator), as well as the corresponding image, and determines the probability of this scanpath being real (or fake). We train them end-to-end in an adversarial manner, following a conditional GAN scheme. Please refer to the text for details on the loss functions and architecture.
  • Figure 3: Results of our model for two different scenes: market and mall from Rai et al.'s dataset rai2017dataset. From left to right: 360$^\circ$ image, ground truth sample scanpath, and three scanpaths generated by our model. The generated scanpaths are plausible and focus on relevant parts of the scene, yet they exhibit the diversity expected among different human observers. Please refer to the supplementary material for a larger set of results.
  • Figure 4: Qualitative comparison to previous methods for five different scenes from Rai et al.'s dataset. In each row, from left to right: 360$^\circ$ image, and a sample scanpath obtained with our method, PathGAN assens2018pathgan, SaltiNet assens2018scanpath, and Zhu et al.'s zhu2018prediction. Note that, in the case of PathGAN, we are including the results directly taken from their paper, thus the different visualization. Our method produces plausible scanpaths focused on meaningful regions, in comparison with other techniques. Please see text for details, and the supplementary material for a larger set of results, also including ground truth scanpaths.
  • Figure 5: Qualitative ablation results. From top to bottom: basic GAN strategy (baseline); adding MSE to the loss function of the former; our approach; and an example ground truth scanpath. These results illustrate the need for our DTW$_{sph}$ loss term.
  • ...and 51 more figures