SalFormer360: a transformer-based saliency estimation model for 360-degree videos

Mahmoud Z. A. Wahba; Francesco Barbato; Sara Baldoni; Federica Battisti

SalFormer360: a transformer-based saliency estimation model for 360-degree videos

Mahmoud Z. A. Wahba, Francesco Barbato, Sara Baldoni, Federica Battisti

TL;DR

SalFormer360 tackles saliency estimation in 360-degree videos by leveraging a transformer-based approach built on a SegFormer encoder, a custom decoder, and a Viewing Center Bias to predict saliency maps from two-frame inputs. The model processes a current frame and a past frame (t and t−5), with a time-adaptive fusion between an initial saliency map and a dataset-specific center bias, controlled by w_t = (1−β_i)δ(t) + β_i and δ(t) = e^{−α_i (t/C)^2} with C = 600. Training uses a composite loss $ abla$ comprising $\\mathcal{L}_{CC}$, $\mathcal{L}_{KL}$, $\mathcal{L}_{SMSE}$ with spherical weighting $\\Psi(\theta,\phi)$, and $\L_{BCE}$, enabling strong alignment with ground-truth saliency while respecting spherical geometry. Empirically, SalFormer360 achieves CC improvements of up to 18.6% on VR-EyeTracking and maintains competitive performance across Sport360 and PVS-HM, while remaining lightweight (3.70M parameters) and capable of real-time inference (~196 fps), making it suitable for on-device viewport prediction in VR/AR systems.

Abstract

Saliency estimation has received growing attention in recent years due to its importance in a wide range of applications. In the context of 360-degree video, it has been particularly valuable for tasks such as viewport prediction and immersive content optimization. In this paper, we propose SalFormer360, a novel saliency estimation model for 360-degree videos built on a transformer-based architecture. Our approach is based on the combination of an existing encoder architecture, SegFormer, and a custom decoder. The SegFormer model was originally developed for 2D segmentation tasks, and it has been fine-tuned to adapt it to 360-degree content. To further enhance prediction accuracy in our model, we incorporated Viewing Center Bias to reflect user attention in 360-degree environments. Extensive experiments on the three largest benchmark datasets for saliency estimation demonstrate that SalFormer360 outperforms existing state-of-the-art methods. In terms of Pearson Correlation Coefficient, our model achieves 8.4% higher performance on Sport360, 2.5% on PVS-HM, and 18.6% on VR-EyeTracking compared to previous state-of-the-art.

SalFormer360: a transformer-based saliency estimation model for 360-degree videos

TL;DR

comprising

with spherical weighting

, and

, enabling strong alignment with ground-truth saliency while respecting spherical geometry. Empirically, SalFormer360 achieves CC improvements of up to 18.6% on VR-EyeTracking and maintains competitive performance across Sport360 and PVS-HM, while remaining lightweight (3.70M parameters) and capable of real-time inference (~196 fps), making it suitable for on-device viewport prediction in VR/AR systems.

Abstract

Paper Structure (23 sections, 4 equations, 7 figures, 5 tables)

This paper contains 23 sections, 4 equations, 7 figures, 5 tables.

Introduction
Related work
SalFormer360 Model
Network Structure
Encoder
Decoder
Viewing Center Bias
Loss Function
Experimental results
Dataset
Experimental Setup
Baseline methods
Metrics
Performance Evaluation
Quantitative results
...and 8 more sections

Figures (7)

Figure 1: SalFormer360 can estimate future salient points in 360-degree videos using only a single previous frame and with limited computational resources.
Figure 2: Segmentation results obtained by feeding 360-degree equirectangular frames into the SegFormer-B0 model.
Figure 3: Overview of the proposed 360-degree saliency estimation framework.
Figure 4: Viewing Center Bias (CB), refer to Sec. \ref{['subsubsec:CB']} for details. (a) PVS-HM dataset, (b) Sport360 dataset, (c) VR-EyeTracking dataset.
Figure 5: Qualitative results. First row: Sport360; second row: PVS-HM; third row: VR-EyeTracking. Left: RGB input; right: ground truth (grayscale) with an overlay of the estimated saliency map (parula colormap).
...and 2 more figures

SalFormer360: a transformer-based saliency estimation model for 360-degree videos

TL;DR

Abstract

SalFormer360: a transformer-based saliency estimation model for 360-degree videos

Authors

TL;DR

Abstract

Table of Contents

Figures (7)