Table of Contents
Fetching ...

Eye image segmentation using visual and concept prompts with Segment Anything Model 3 (SAM3)

Diederick C. Niehorster, Marcus Nyström

Abstract

Previous work has reported that vision foundation models show promising zero-shot performance in eye image segmentation. Here we examine whether the latest iteration of the Segment Anything Model, SAM3, offers better eye image segmentation performance than SAM2, and explore the performance of its new concept (text) prompting mode. Eye image segmentation performance was evaluated using diverse datasets encompassing both high-resolution high-quality videos from a lab environment and the TEyeD dataset consisting of challenging eye videos acquired in the wild. Results show that in most cases SAM3 with either visual or concept prompts did not perform better than SAM2, for both lab and in-the-wild datasets. Since SAM2 not only performed better but was also faster, we conclude that SAM2 remains the best option for eye image segmentation. We provide our adaptation of SAM3's codebase that allows processing videos of arbitrary duration.

Eye image segmentation using visual and concept prompts with Segment Anything Model 3 (SAM3)

Abstract

Previous work has reported that vision foundation models show promising zero-shot performance in eye image segmentation. Here we examine whether the latest iteration of the Segment Anything Model, SAM3, offers better eye image segmentation performance than SAM2, and explore the performance of its new concept (text) prompting mode. Eye image segmentation performance was evaluated using diverse datasets encompassing both high-resolution high-quality videos from a lab environment and the TEyeD dataset consisting of challenging eye videos acquired in the wild. Results show that in most cases SAM3 with either visual or concept prompts did not perform better than SAM2, for both lab and in-the-wild datasets. Since SAM2 not only performed better but was also faster, we conclude that SAM2 remains the best option for eye image segmentation. We provide our adaptation of SAM3's codebase that allows processing videos of arbitrary duration.
Paper Structure (12 sections, 3 figures, 1 table)

This paper contains 12 sections, 3 figures, 1 table.

Figures (3)

  • Figure 1: Example prompts and resulting segmentations for the high resolution lab datasets (left) and the TEyeD datasets (right) for SAM 2 visual prompting (top), SAM 3 visual prompting (middle) and SAM 3 concept prompting (bottom). For the lab images, the prompted frame, the 100th frame and the 1000th frame in the eye video are shown. For the TEyeD images, the prompted frame, the 1000th, the 10000th and the 20000th frame are shown. For the lab datasets (left): the dark blue mask indicates the CR, green the pupil, red the iris and cyan the sclera. For the TEyeD datasets (right): dark blue is the pupil, green the iris, red the sclera. Positive (+) and negative (square) prompts use the same color codes. For the bottom row, different colors instead indicate different "pupil" objects returned by the model. The brown in some segmentations results from the iris and sclera masks overlapping.
  • Figure 2: Prompting TEyeD. The left panel shows the ground truth annotations provided by TEyeD for the pupil (green ellipse), iris (red ellipse) and palpebral fissure (yellow polygon). Also indicated are the determined eye corners (cyan dots), closest points on the iris ellipse (orange points) and the derived prompt coordinates for the pupil (green +), the iris (red +) and the sclera (blue and magenta +). The right panel shows the corresponding positive (+) and negative (square) visual prompts provided to both SAM models (blue: pupil, green: iris, red: sclera). Also shown are the output masks from SAM 3, using the same color codes as the prompts.
  • Figure 3: RMS-S2S precision and data loss for SAM 2 and SAM 3 on the high-resolution lab datasets. Shown are the RMS-S2S precision (lower values is better) and data loss (lower is better) per participant. Note that the range of the y-axis is different for each of the panels. A summary showing the precision or data loss on the same scale is shown in the right most bar graphs (error bars indicate SEM across participants). Stars indicate significant differences according to paired t-tests.