Zero-Shot Pupil Segmentation with SAM 2: A Case Study of Over 14 Million Images
Virmarie Maquiling, Sean Anthony Byrne, Diederick C. Niehorster, Marco Carminati, Enkelejda Kasneci
TL;DR
The paper assesses the zero-shot pupil segmentation capabilities of SAM 2 for video-based eye tracking, addressing the annotation bottleneck by enabling a single prompt per video. Using over 14 million pupil images across VR and mobile datasets, SAM 2 achieves competitive mean IoU scores (up to around $93\%$) without any fine-tuning, while dramatically reducing manual annotation effort. The study demonstrates strong cross-dataset generalization, discusses practical lessons and open challenges, and releases code and masks to foster reproducibility. This work highlights the potential of vision foundation models to democratize large-scale gaze-estimation data collection and standardization, enabling scalable, accessible eye-tracking research across diverse platforms.
Abstract
We explore the transformative potential of SAM 2, a vision foundation model, in advancing gaze estimation and eye tracking technologies. By significantly reducing annotation time, lowering technical barriers through its ease of deployment, and enhancing segmentation accuracy, SAM 2 addresses critical challenges faced by researchers and practitioners. Utilizing its zero-shot segmentation capabilities with minimal user input-a single click per video-we tested SAM 2 on over 14 million eye images from diverse datasets, including virtual reality setups and the world's largest unified dataset recorded using wearable eye trackers. Remarkably, in pupil segmentation tasks, SAM 2 matches the performance of domain-specific models trained solely on eye images, achieving competitive mean Intersection over Union (mIoU) scores of up to 93% without fine-tuning. Additionally, we provide our code and segmentation masks for these widely used datasets to promote further research.
