Learning Gaussian Representation for Eye Fixation Prediction
Peipei Song, Jing Zhang, Piotr Koniusz, Nick Barnes
TL;DR
This work tackles the stochastic nature of human eye fixations by modeling fixation maps as probabilistic distributions using a Gaussian Mixture Model ($GMM$) rather than dense per-pixel maps. It introduces SalGMM, an end-to-end network that predicts $GMM$ parameters through a three-part architecture (Feature Net, Parameter Transformation, Reconstruction Loss) and an anchor-based coordinate regression scheme, enabling real-time inference with lightweight backbones. Experiments on SALICON, MIT1003, and TORONTO demonstrate competitive accuracy across standard saliency metrics while achieving significant speedups and smaller model sizes, suitable for edge devices. By learning in the $GMM$ parameter space, the approach offers robustness to fixation variability and a compact representation that preserves key attention patterns across images.
Abstract
Existing eye fixation prediction methods perform the mapping from input images to the corresponding dense fixation maps generated from raw fixation points. However, due to the stochastic nature of human fixation, the generated dense fixation maps may be a less-than-ideal representation of human fixation. To provide a robust fixation model, we introduce Gaussian Representation for eye fixation modeling. Specifically, we propose to model the eye fixation map as a mixture of probability distributions, namely a Gaussian Mixture Model. In this new representation, we use several Gaussian distribution components as an alternative to the provided fixation map, which makes the model more robust to the randomness of fixation. Meanwhile, we design our framework upon some lightweight backbones to achieve real-time fixation prediction. Experimental results on three public fixation prediction datasets (SALICON, MIT1003, TORONTO) demonstrate that our method is fast and effective.
