Table of Contents
Fetching ...

Few-shot Personalized Scanpath Prediction

Ruoyu Xue, Jingyi Xu, Sounak Mondal, Hieu Le, Gregory Zelinsky, Minh Hoai, Dimitris Samaras

TL;DR

This work tackles the challenge of few-shot personalized scanpath prediction by decoupling subject representation learning from the scanpath predictor. It introduces SE-Net to extract robust subject embeddings from limited gaze data and uses ISP-SENet, a subject-conditioned predictor, to generate personalized scanpaths without test-time fine-tuning. Training SE-Net with triplet and contrastive losses on a base dataset enables effective generalization to unseen subjects when provided with a small support set, achieving strong performance on OSIE, COCO-FreeView, and COCO-Search18 across 1, 5, and 10-shot scenarios. The approach offers rapid adaptation (on the order of a few seconds) and demonstrates both improved accuracy and interpretability of which fixations drive subject differentiation, supporting practical deployment in eye-tracking applications.

Abstract

A personalized model for scanpath prediction provides insights into the visual preferences and attention patterns of individual subjects. However, existing methods for training scanpath prediction models are data-intensive and cannot be effectively personalized to new individuals with only a few available examples. In this paper, we propose few-shot personalized scanpath prediction task (FS-PSP) and a novel method to address it, which aims to predict scanpaths for an unseen subject using minimal support data of that subject's scanpath behavior. The key to our method's adaptability is the Subject-Embedding Network (SE-Net), specifically designed to capture unique, individualized representations for each subject's scanpaths. SE-Net generates subject embeddings that effectively distinguish between subjects while minimizing variability among scanpaths from the same individual. The personalized scanpath prediction model is then conditioned on these subject embeddings to produce accurate, personalized results. Experiments on multiple eye-tracking datasets demonstrate that our method excels in FS-PSP settings and does not require any fine-tuning steps at test time. Code is available at: https://github.com/cvlab-stonybrook/few-shot-scanpath

Few-shot Personalized Scanpath Prediction

TL;DR

This work tackles the challenge of few-shot personalized scanpath prediction by decoupling subject representation learning from the scanpath predictor. It introduces SE-Net to extract robust subject embeddings from limited gaze data and uses ISP-SENet, a subject-conditioned predictor, to generate personalized scanpaths without test-time fine-tuning. Training SE-Net with triplet and contrastive losses on a base dataset enables effective generalization to unseen subjects when provided with a small support set, achieving strong performance on OSIE, COCO-FreeView, and COCO-Search18 across 1, 5, and 10-shot scenarios. The approach offers rapid adaptation (on the order of a few seconds) and demonstrates both improved accuracy and interpretability of which fixations drive subject differentiation, supporting practical deployment in eye-tracking applications.

Abstract

A personalized model for scanpath prediction provides insights into the visual preferences and attention patterns of individual subjects. However, existing methods for training scanpath prediction models are data-intensive and cannot be effectively personalized to new individuals with only a few available examples. In this paper, we propose few-shot personalized scanpath prediction task (FS-PSP) and a novel method to address it, which aims to predict scanpaths for an unseen subject using minimal support data of that subject's scanpath behavior. The key to our method's adaptability is the Subject-Embedding Network (SE-Net), specifically designed to capture unique, individualized representations for each subject's scanpaths. SE-Net generates subject embeddings that effectively distinguish between subjects while minimizing variability among scanpaths from the same individual. The personalized scanpath prediction model is then conditioned on these subject embeddings to produce accurate, personalized results. Experiments on multiple eye-tracking datasets demonstrate that our method excels in FS-PSP settings and does not require any fine-tuning steps at test time. Code is available at: https://github.com/cvlab-stonybrook/few-shot-scanpath

Paper Structure

This paper contains 32 sections, 6 equations, 10 figures, 15 tables.

Figures (10)

  • Figure 1: Few-shot Personalized Scanpath Prediction (FS-PSP). Given a new subject with only a few support examples of their gaze behavior, can we adapt a base scanpath prediction model to this subject? We propose a subject-embedding extracting network, SE-Net, to achieve this personalized adaption.
  • Figure 2: Overview of ISP-SENet: Our method for few-shot personalized scanpath prediction has two stages. In the training stage, we train two models on a large amount of image-scanpath pairs $\mathcal{D}_{base}$, corresponding to a set of seen subjects. Initially, we train the Subject Embedding Network (SE-Net) to obtain embeddings for seen subjects, followed by training ISP-SENet to predict scanpaths using these embeddings. In the inference phase, both models are frozen, and we extract embeddings for unseen subjects from the support set, $\mathcal{D}_{supp}$, which consists of $n$-shot images sampled from the base set. These unseen subject embeddings then guide ISP-SENet in predicting scanpaths for unseen subjects using the query set, $\mathcal{D}_{query}$, which includes a collection of unseen images.
  • Figure 3: Structure of SE-Net. SE-Net employs a feature extractor $F$ to derive image and scanpath semantic features, $F_I$ and $F_S$, respectively. The CSE module then processes task and duration features, updating the scanpath embedding constrained by all extracted features. An initialized embedding learns human attention information from $F_S'$ to produce the subject embedding $e$. This triplet network assesses the distances among $e^{d_+}, e^{d_+}, e^{d_-}$, and the UP module predicts the subject ID. All CSE modules share the same weights, as do the USD and UP modules.
  • Figure 4: Qualitative examples of scanpath prediction for different unseen subjects. GT is the ground truth scanpaths of different unseen subjects. Red circle is the end fixation. In the third row, the subject is searching for "bowl". The results indicates our method is able to capture the temporal order of fixations, fixation distributions, and distractions, while baseline keeps predicting similar scanpaths of different subjects.
  • Figure 5: Model interpretability. By analyzing a large dataset of seen subject-scanpath pairs, SE-Net determines the most influential fixations in shaping unseen subject embeddings. Fixations highlighted in blue represent the two with the highest weights. This analysis demonstrates that ISP-SENet can effectively identify which fixations are crucial for distinguishing between subjects.
  • ...and 5 more figures