Table of Contents
Fetching ...

Pathformer3D: A 3D Scanpath Transformer for 360° Images

Rong Quan, Yantao Lai, Mengyu Qiu, Dong Liang

TL;DR

This work tackles the challenge of predicting human gaze paths in 360° images, where 2D projections introduce distortion and coordinate discontinuities. It introduces Pathformer3D, a 3D spherical-coordinate framework that employs a SphereNetS-CNN for 3D feature extraction, a 3D Transformer encoder for contextualization, a Transformer decoder with visual working-memory-inspired attention, and a 3D Mixture Density Network to sample fixations from 3D Gaussian mixtures. The approach achieves state-of-the-art results on four panoramic eye-tracking datasets, closely matching human performance across multiple metrics and providing competitive efficiency given its autoregressive sampling. By operating in 3D space and explicitly modeling temporal dependencies and multimodal fixation distributions, Pathformer3D offers a robust, distortion-free mechanism for 360° gaze prediction with practical implications for rapid rendering and interactive VR/AR experiences.

Abstract

Scanpath prediction in 360° images can help realize rapid rendering and better user interaction in Virtual/Augmented Reality applications. However, existing scanpath prediction models for 360° images execute scanpath prediction on 2D equirectangular projection plane, which always result in big computation error owing to the 2D plane's distortion and coordinate discontinuity. In this work, we perform scanpath prediction for 360° images in 3D spherical coordinate system and proposed a novel 3D scanpath Transformer named Pathformer3D. Specifically, a 3D Transformer encoder is first used to extract 3D contextual feature representation for the 360° image. Then, the contextual feature representation and historical fixation information are input into a Transformer decoder to output current time step's fixation embedding, where the self-attention module is used to imitate the visual working memory mechanism of human visual system and directly model the time dependencies among the fixations. Finally, a 3D Gaussian distribution is learned from each fixation embedding, from which the fixation position can be sampled. Evaluation on four panoramic eye-tracking datasets demonstrates that Pathformer3D outperforms the current state-of-the-art methods. Code is available at https://github.com/lsztzp/Pathformer3D .

Pathformer3D: A 3D Scanpath Transformer for 360° Images

TL;DR

This work tackles the challenge of predicting human gaze paths in 360° images, where 2D projections introduce distortion and coordinate discontinuities. It introduces Pathformer3D, a 3D spherical-coordinate framework that employs a SphereNetS-CNN for 3D feature extraction, a 3D Transformer encoder for contextualization, a Transformer decoder with visual working-memory-inspired attention, and a 3D Mixture Density Network to sample fixations from 3D Gaussian mixtures. The approach achieves state-of-the-art results on four panoramic eye-tracking datasets, closely matching human performance across multiple metrics and providing competitive efficiency given its autoregressive sampling. By operating in 3D space and explicitly modeling temporal dependencies and multimodal fixation distributions, Pathformer3D offers a robust, distortion-free mechanism for 360° gaze prediction with practical implications for rapid rendering and interactive VR/AR experiences.

Abstract

Scanpath prediction in 360° images can help realize rapid rendering and better user interaction in Virtual/Augmented Reality applications. However, existing scanpath prediction models for 360° images execute scanpath prediction on 2D equirectangular projection plane, which always result in big computation error owing to the 2D plane's distortion and coordinate discontinuity. In this work, we perform scanpath prediction for 360° images in 3D spherical coordinate system and proposed a novel 3D scanpath Transformer named Pathformer3D. Specifically, a 3D Transformer encoder is first used to extract 3D contextual feature representation for the 360° image. Then, the contextual feature representation and historical fixation information are input into a Transformer decoder to output current time step's fixation embedding, where the self-attention module is used to imitate the visual working memory mechanism of human visual system and directly model the time dependencies among the fixations. Finally, a 3D Gaussian distribution is learned from each fixation embedding, from which the fixation position can be sampled. Evaluation on four panoramic eye-tracking datasets demonstrates that Pathformer3D outperforms the current state-of-the-art methods. Code is available at https://github.com/lsztzp/Pathformer3D .
Paper Structure (32 sections, 8 equations, 3 figures, 4 tables)

This paper contains 32 sections, 8 equations, 3 figures, 4 tables.

Figures (3)

  • Figure 1: Overall architecture of our methods. The Transformer Encoder is utilized to contextualize the feature extracted by the SphereNetS-CNN. The contextual feature representation and the historical fixations are fed into the Transformer Decoder to output fixation embeddings. Each fixation embedding is input into a 3D Mixture Density Network to output its 3D Gaussian distribution, from which the fixation is sampled.
  • Figure 2: Qualitative representation results of different models on four datasets. From top to bottom, the four images are sourced from SitzmannSitzmann, Salient360!salient360!, AOIAOI, and JUFEJUFE, respectively. From left to right, we display the a scanpath sampled from ground truth, our model, ScanDMMScanDMM, ScanGANScanGan360, and SaltiNetSaltiNet.
  • Figure 3: Qualitative comparison results of significance comparison.The first two images is from Salient360!salient360! and the next three images are from JUFEJUFE. From top to bottom are the real image, the significance maps of the real images, Ours, ScanDMMScanDMM, ScanGANScanGan360.