Table of Contents
Fetching ...

Zero-Shot Pupil Segmentation with SAM 2: A Case Study of Over 14 Million Images

Virmarie Maquiling, Sean Anthony Byrne, Diederick C. Niehorster, Marco Carminati, Enkelejda Kasneci

TL;DR

The paper assesses the zero-shot pupil segmentation capabilities of SAM 2 for video-based eye tracking, addressing the annotation bottleneck by enabling a single prompt per video. Using over 14 million pupil images across VR and mobile datasets, SAM 2 achieves competitive mean IoU scores (up to around $93\%$) without any fine-tuning, while dramatically reducing manual annotation effort. The study demonstrates strong cross-dataset generalization, discusses practical lessons and open challenges, and releases code and masks to foster reproducibility. This work highlights the potential of vision foundation models to democratize large-scale gaze-estimation data collection and standardization, enabling scalable, accessible eye-tracking research across diverse platforms.

Abstract

We explore the transformative potential of SAM 2, a vision foundation model, in advancing gaze estimation and eye tracking technologies. By significantly reducing annotation time, lowering technical barriers through its ease of deployment, and enhancing segmentation accuracy, SAM 2 addresses critical challenges faced by researchers and practitioners. Utilizing its zero-shot segmentation capabilities with minimal user input-a single click per video-we tested SAM 2 on over 14 million eye images from diverse datasets, including virtual reality setups and the world's largest unified dataset recorded using wearable eye trackers. Remarkably, in pupil segmentation tasks, SAM 2 matches the performance of domain-specific models trained solely on eye images, achieving competitive mean Intersection over Union (mIoU) scores of up to 93% without fine-tuning. Additionally, we provide our code and segmentation masks for these widely used datasets to promote further research.

Zero-Shot Pupil Segmentation with SAM 2: A Case Study of Over 14 Million Images

TL;DR

The paper assesses the zero-shot pupil segmentation capabilities of SAM 2 for video-based eye tracking, addressing the annotation bottleneck by enabling a single prompt per video. Using over 14 million pupil images across VR and mobile datasets, SAM 2 achieves competitive mean IoU scores (up to around ) without any fine-tuning, while dramatically reducing manual annotation effort. The study demonstrates strong cross-dataset generalization, discusses practical lessons and open challenges, and releases code and masks to foster reproducibility. This work highlights the potential of vision foundation models to democratize large-scale gaze-estimation data collection and standardization, enabling scalable, accessible eye-tracking research across diverse platforms.

Abstract

We explore the transformative potential of SAM 2, a vision foundation model, in advancing gaze estimation and eye tracking technologies. By significantly reducing annotation time, lowering technical barriers through its ease of deployment, and enhancing segmentation accuracy, SAM 2 addresses critical challenges faced by researchers and practitioners. Utilizing its zero-shot segmentation capabilities with minimal user input-a single click per video-we tested SAM 2 on over 14 million eye images from diverse datasets, including virtual reality setups and the world's largest unified dataset recorded using wearable eye trackers. Remarkably, in pupil segmentation tasks, SAM 2 matches the performance of domain-specific models trained solely on eye images, achieving competitive mean Intersection over Union (mIoU) scores of up to 93% without fine-tuning. Additionally, we provide our code and segmentation masks for these widely used datasets to promote further research.

Paper Structure

This paper contains 11 sections, 3 figures, 1 table.

Figures (3)

  • Figure 1: An illustration demonstrating the data annotation process with SAM 2: the user provides a single point prompt via a mouse click, and SAM 2 automatically handles the rest of the segmentation process. Optionally, the user can refine and add additional prompts in more difficult areas of the video to improve the model's output.
  • Figure 2: Comparison between Segment Anything Model 2 (top), a traditional segmentation model trained specifically on eye tracking datasets (left), and Segment Anything Model (right). The sample eye images are taken from the GW dataset kothari2020gaze.
  • Figure 3: SAM 2 results on various VR-(first and second rows) and mobile eye tracking (third and last rows) datasets. Images are taken from the OpenEDS2019 garbin2019openeds, OpenEDS2020 palmero2020openeds2020, LPW tonsen2016labelled, and the Dikablis datasets fuhl2021teyed. SAM 2 handled occlusions remarkably well as observed in rows 1 and 4, and effectively segmented the pupil across a wide range of datasets, showing its robustness to different eye tracking conditions.