Learning Gaussian Representation for Eye Fixation Prediction

Peipei Song; Jing Zhang; Piotr Koniusz; Nick Barnes

Learning Gaussian Representation for Eye Fixation Prediction

Peipei Song, Jing Zhang, Piotr Koniusz, Nick Barnes

TL;DR

This work tackles the stochastic nature of human eye fixations by modeling fixation maps as probabilistic distributions using a Gaussian Mixture Model ($GMM$) rather than dense per-pixel maps. It introduces SalGMM, an end-to-end network that predicts $GMM$ parameters through a three-part architecture (Feature Net, Parameter Transformation, Reconstruction Loss) and an anchor-based coordinate regression scheme, enabling real-time inference with lightweight backbones. Experiments on SALICON, MIT1003, and TORONTO demonstrate competitive accuracy across standard saliency metrics while achieving significant speedups and smaller model sizes, suitable for edge devices. By learning in the $GMM$ parameter space, the approach offers robustness to fixation variability and a compact representation that preserves key attention patterns across images.

Abstract

Existing eye fixation prediction methods perform the mapping from input images to the corresponding dense fixation maps generated from raw fixation points. However, due to the stochastic nature of human fixation, the generated dense fixation maps may be a less-than-ideal representation of human fixation. To provide a robust fixation model, we introduce Gaussian Representation for eye fixation modeling. Specifically, we propose to model the eye fixation map as a mixture of probability distributions, namely a Gaussian Mixture Model. In this new representation, we use several Gaussian distribution components as an alternative to the provided fixation map, which makes the model more robust to the randomness of fixation. Meanwhile, we design our framework upon some lightweight backbones to achieve real-time fixation prediction. Experimental results on three public fixation prediction datasets (SALICON, MIT1003, TORONTO) demonstrate that our method is fast and effective.

Learning Gaussian Representation for Eye Fixation Prediction

TL;DR

This work tackles the stochastic nature of human eye fixations by modeling fixation maps as probabilistic distributions using a Gaussian Mixture Model (

) rather than dense per-pixel maps. It introduces SalGMM, an end-to-end network that predicts

parameters through a three-part architecture (Feature Net, Parameter Transformation, Reconstruction Loss) and an anchor-based coordinate regression scheme, enabling real-time inference with lightweight backbones. Experiments on SALICON, MIT1003, and TORONTO demonstrate competitive accuracy across standard saliency metrics while achieving significant speedups and smaller model sizes, suitable for edge devices. By learning in the

parameter space, the approach offers robustness to fixation variability and a compact representation that preserves key attention patterns across images.

Abstract

Paper Structure (13 sections, 5 equations, 7 figures, 5 tables, 1 algorithm)

This paper contains 13 sections, 5 equations, 7 figures, 5 tables, 1 algorithm.

Introduction
Related Work
Method
Fixation dataset analysis
Our proposed network-SalGMM
Feature Net
Parameter Transformation
Reconstruction Loss
Experimental Results
Experimental Setup
Performance evaluation
Ablation Study
Conclusions

Figures (7)

Figure 1: Comparison of state-of-the-art methods w.r.t. the number of model parameters, inference speed and the performance (correlation coefficient). The size of the circles indicates the number of model parameters. "SalGMM-*" are several of our models where "*" should be replaced by specific names of backbone networks.
Figure 2: Dense fixation maps with different times of random selection. Param. "r=0.7" indicates that we randomly sample 70% participants to generate the Gaussian blurred eye fixation map.
Figure 3: Visualization of the image, annotations, and GMM fitted fixation maps. Fixation map (c) is the Gaussian blurred eye fixation map. Figures (d)-(h) show the reconstructed fixation map given a different number of Gaussian components $C$. The solid circle and radius of the outer circle are the mean and standard deviation of the fitted Gaussian distribution.
Figure 4: Network architecture. Note that centers of Gaussians are expressed w.r.t. the spatial reference grid.
Figure 5: Three kinds of anchor settings (reference points).
...and 2 more figures

Learning Gaussian Representation for Eye Fixation Prediction

TL;DR

Abstract

Learning Gaussian Representation for Eye Fixation Prediction

Authors

TL;DR

Abstract

Table of Contents

Figures (7)