SalFormer360: a transformer-based saliency estimation model for 360-degree videos
Mahmoud Z. A. Wahba, Francesco Barbato, Sara Baldoni, Federica Battisti
TL;DR
SalFormer360 tackles saliency estimation in 360-degree videos by leveraging a transformer-based approach built on a SegFormer encoder, a custom decoder, and a Viewing Center Bias to predict saliency maps from two-frame inputs. The model processes a current frame and a past frame (t and t−5), with a time-adaptive fusion between an initial saliency map and a dataset-specific center bias, controlled by w_t = (1−β_i)δ(t) + β_i and δ(t) = e^{−α_i (t/C)^2} with C = 600. Training uses a composite loss $ abla$ comprising $\\mathcal{L}_{CC}$, $\mathcal{L}_{KL}$, $\mathcal{L}_{SMSE}$ with spherical weighting $\\Psi(\theta,\phi)$, and $\L_{BCE}$, enabling strong alignment with ground-truth saliency while respecting spherical geometry. Empirically, SalFormer360 achieves CC improvements of up to 18.6% on VR-EyeTracking and maintains competitive performance across Sport360 and PVS-HM, while remaining lightweight (3.70M parameters) and capable of real-time inference (~196 fps), making it suitable for on-device viewport prediction in VR/AR systems.
Abstract
Saliency estimation has received growing attention in recent years due to its importance in a wide range of applications. In the context of 360-degree video, it has been particularly valuable for tasks such as viewport prediction and immersive content optimization. In this paper, we propose SalFormer360, a novel saliency estimation model for 360-degree videos built on a transformer-based architecture. Our approach is based on the combination of an existing encoder architecture, SegFormer, and a custom decoder. The SegFormer model was originally developed for 2D segmentation tasks, and it has been fine-tuned to adapt it to 360-degree content. To further enhance prediction accuracy in our model, we incorporated Viewing Center Bias to reflect user attention in 360-degree environments. Extensive experiments on the three largest benchmark datasets for saliency estimation demonstrate that SalFormer360 outperforms existing state-of-the-art methods. In terms of Pearson Correlation Coefficient, our model achieves 8.4% higher performance on Sport360, 2.5% on PVS-HM, and 18.6% on VR-EyeTracking compared to previous state-of-the-art.
