Zero-Shot Segmentation of Eye Features Using the Segment Anything Model (SAM)

Virmarie Maquiling; Sean Anthony Byrne; Diederick C. Niehorster; Marcus Nyström; Enkelejda Kasneci

Zero-Shot Segmentation of Eye Features Using the Segment Anything Model (SAM)

Virmarie Maquiling, Sean Anthony Byrne, Diederick C. Niehorster, Marcus Nyström, Enkelejda Kasneci

TL;DR

This work tests zero-shot eye-region segmentation using the Segment Anything Model (SAM) on VR eye images from OpenEDS2019 and OpenEDS2020, evaluating multiple prompting strategies. It reports high pupil IoU (up to 93.34%) and strong iris segmentation with prompts (up to 86.63%), while sclera segmentation remains challenging (up to ~62% IoU), leading to overall mIoU around 85.7%–83.9% that improves to ~91% when excluding sclera. The findings show SAM can match expert-level annotations for pupil segmentation in a zero-shot setting and can benefit from prompts for iris and sclera, but highlight limitations that motivate domain-specific fine-tuning or eye-focused foundation models. This work suggests foundation models can reduce annotation burdens and improve generalization in gaze estimation, guiding future development toward tailored eye-image foundations or fine-tuning on eye datasets, with practical impact for democratizing eye-tracking technology. The authors provide code to reproduce and extend their experiments.

Abstract

The advent of foundation models signals a new era in artificial intelligence. The Segment Anything Model (SAM) is the first foundation model for image segmentation. In this study, we evaluate SAM's ability to segment features from eye images recorded in virtual reality setups. The increasing requirement for annotated eye-image datasets presents a significant opportunity for SAM to redefine the landscape of data annotation in gaze estimation. Our investigation centers on SAM's zero-shot learning abilities and the effectiveness of prompts like bounding boxes or point clicks. Our results are consistent with studies in other domains, demonstrating that SAM's segmentation effectiveness can be on-par with specialized models depending on the feature, with prompts improving its performance, evidenced by an IoU of 93.34% for pupil segmentation in one dataset. Foundation models like SAM could revolutionize gaze estimation by enabling quick and easy image segmentation, reducing reliance on specialized models and extensive manual annotation.

Zero-Shot Segmentation of Eye Features Using the Segment Anything Model (SAM)

TL;DR

Abstract

Paper Structure (16 sections, 4 equations, 6 figures, 1 table)

This paper contains 16 sections, 4 equations, 6 figures, 1 table.

Introduction
RELATED WORK
An Introduction to Foundation models
The Segment Anything Model
Applications of the Segment Anything Model
Segmentation of eye-images recorded using video-based eye-tracking systems
Methodology
Gaze Datasets
Prompting Strategies
Automated/no prompting
Manual prompting
Evaluation
Results
Discussion
Limitations and Future Work
...and 1 more sections

Figures (6)

Figure 1: A high-level schematic of the Segment Anything Model (SAM) featuring an image encoder, a prompt encoder, and a lightweight mask decoder. The image is fed to the image encoder, while a set of prompts (in this example, depicted as point- and bounding box prompts drawn on the image) are fed to the prompt encoder. The embeddings generated by the image and prompt encoders are passed to the mask decoder which outputs valid masks with varying confidence scores. Alternatively, it may output a single mask averaged over all valid masks. Image source from palmero2021openeds2020.
Figure 2: Representative images from OpenEDS2019 garbin2020dataset (left) and OpenEDS2020 palmero2021openeds2020 (right) datasets. Both datasets are captured using a VR head-mounted display equipped with dual synchronous eye-facing cameras. The OpenEDS2019 dataset contains images with a resolution of $400\times640$ while the images from the OpenEDS2020 dataset have a resolution $640\times400$.
Figure 3: Prompt strategies for segmenting pupil, iris, and sclera. Green points represent foreground point prompts, while red points represent background point prompts. Bounding box prompts are visualized as light blue rectangles surrounding the feature of interest.
Figure 4: Performance of each strategy on different segmentation tasks (pupil, iris, and sclera segmentation). Each row represents a different metric. On each plot, the best-performing strategy (highest Dice/IoU and lowest HD) is colored red.
Figure 5: Visualization of SAM's performance on an image from OpenEDS2019 garbin2020dataset. The leftmost column shows the ground truth masks for pupil (top row), iris (middle row), and sclera (bottom row) overlayed on the input image. The remaining columns show SAM's segmentations using different strategies.
...and 1 more figures

Zero-Shot Segmentation of Eye Features Using the Segment Anything Model (SAM)

TL;DR

Abstract

Zero-Shot Segmentation of Eye Features Using the Segment Anything Model (SAM)

Authors

TL;DR

Abstract

Table of Contents

Figures (6)