Table of Contents
Fetching ...

Learning User Embeddings from Human Gaze for Personalised Saliency Prediction

Florian Strohm, Mihai Bâce, Andreas Bulling

TL;DR

A novel method to extract user embeddings from pairs of natural images and corresponding saliency maps generated from a small amount of user-specific eye tracking data using a Siamese convolutional neural encoder that learns the user embeddings by contrasting the image and personal saliency map pairs of different users.

Abstract

Reusable embeddings of user behaviour have shown significant performance improvements for the personalised saliency prediction task. However, prior works require explicit user characteristics and preferences as input, which are often difficult to obtain. We present a novel method to extract user embeddings from pairs of natural images and corresponding saliency maps generated from a small amount of user-specific eye tracking data. At the core of our method is a Siamese convolutional neural encoder that learns the user embeddings by contrasting the image and personal saliency map pairs of different users. Evaluations on two public saliency datasets show that the generated embeddings have high discriminative power, are effective at refining universal saliency maps to the individual users, and generalise well across users and images. Finally, based on our model's ability to encode individual user characteristics, our work points towards other applications that can benefit from reusable embeddings of gaze behaviour.

Learning User Embeddings from Human Gaze for Personalised Saliency Prediction

TL;DR

A novel method to extract user embeddings from pairs of natural images and corresponding saliency maps generated from a small amount of user-specific eye tracking data using a Siamese convolutional neural encoder that learns the user embeddings by contrasting the image and personal saliency map pairs of different users.

Abstract

Reusable embeddings of user behaviour have shown significant performance improvements for the personalised saliency prediction task. However, prior works require explicit user characteristics and preferences as input, which are often difficult to obtain. We present a novel method to extract user embeddings from pairs of natural images and corresponding saliency maps generated from a small amount of user-specific eye tracking data. At the core of our method is a Siamese convolutional neural encoder that learns the user embeddings by contrasting the image and personal saliency map pairs of different users. Evaluations on two public saliency datasets show that the generated embeddings have high discriminative power, are effective at refining universal saliency maps to the individual users, and generalise well across users and images. Finally, based on our model's ability to encode individual user characteristics, our work points towards other applications that can benefit from reusable embeddings of gaze behaviour.
Paper Structure (22 sections, 3 equations, 5 figures, 4 tables)

This paper contains 22 sections, 3 equations, 5 figures, 4 tables.

Figures (5)

  • Figure 1: The architecture of our proposed user embedding extractor involves processing multiple images alongside an additional channel that includes the saliency information of a specific user, from which we aim to extract an embedding. To accomplish this, we employ a Siamese convolutional neural network, which is responsible for extracting features from each pair of image-saliency maps. Subsequently, the extracted features are averaged and normalised, resulting in the user embedding.
  • Figure 2: The personalised saliency map (PSM) network operates by taking an image stimulus and its corresponding universal saliency map (USM) as input. In addition, it incorporates user embedding, which is utilised to predict kernel weights, which are then convolved over the image-USM features. The network outputs a discrepancy map which can be added to the USM in order to generate the PSM.
  • Figure 3: Example PSM predictions for two users from the PS xu2018personalized test set with our proposed method compared to the ground truths.
  • Figure 4: Performance comparison of different user embedding extraction models. The y-axis indicates the model's accuracy and the x-axis how many image-PSM examples $m$ were used as input to extract the embedding (4, 8, 16, 32, 48 or 64). We report the accuracy for unseen participants on the Individual Differences (ID) dataset, the Personal Saliency (PS) dataset and for the combined CD dataset.
  • Figure 5: We use t-SNE to reduce the 32-dimensional embeddings into two dimensions. Figure (a) shows the embedding space for embeddings extracted using $m=8$ image-PSM pairs while in (b) we used $m=32$ pairs. Each dot represents one embedding extracted using $m$ randomly sampled image-PSM pairs. Each colour corresponds to a unique user from the ID test set.