Table of Contents
Fetching ...

Benchmarking Human and Automated Prompting in the Segment Anything Model

Jorge Quesada, Zoe Fowler, Mohammad Alotaibi, Mohit Prabhushankar, Ghassan AlRegib

TL;DR

A recently released visual prompting dataset, PointPrompt, is leverage and a number of benchmarking tasks are introduced that provide an array of opportunities to improve the understanding of the way human prompts differ from automated ones and what underlying factors make for effective visual prompts.

Abstract

The remarkable capabilities of the Segment Anything Model (SAM) for tackling image segmentation tasks in an intuitive and interactive manner has sparked interest in the design of effective visual prompts. Such interest has led to the creation of automated point prompt selection strategies, typically motivated from a feature extraction perspective. However, there is still very little understanding of how appropriate these automated visual prompting strategies are, particularly when compared to humans, across diverse image domains. Additionally, the performance benefits of including such automated visual prompting strategies within the finetuning process of SAM also remains unexplored, as does the effect of interpretable factors like distance between the prompt points on segmentation performance. To bridge these gaps, we leverage a recently released visual prompting dataset, PointPrompt, and introduce a number of benchmarking tasks that provide an array of opportunities to improve the understanding of the way human prompts differ from automated ones and what underlying factors make for effective visual prompts. We demonstrate that the resulting segmentation scores obtained by humans are approximately 29% higher than those given by automated strategies and identify potential features that are indicative of prompting performance with $R^2$ scores over 0.5. Additionally, we demonstrate that performance when using automated methods can be improved by up to 68% via a finetuning approach. Overall, our experiments not only showcase the existing gap between human prompts and automated methods, but also highlight potential avenues through which this gap can be leveraged to improve effective visual prompt design. Further details along with the dataset links and codes are available at https://github.com/olivesgatech/PointPrompt

Benchmarking Human and Automated Prompting in the Segment Anything Model

TL;DR

A recently released visual prompting dataset, PointPrompt, is leverage and a number of benchmarking tasks are introduced that provide an array of opportunities to improve the understanding of the way human prompts differ from automated ones and what underlying factors make for effective visual prompts.

Abstract

The remarkable capabilities of the Segment Anything Model (SAM) for tackling image segmentation tasks in an intuitive and interactive manner has sparked interest in the design of effective visual prompts. Such interest has led to the creation of automated point prompt selection strategies, typically motivated from a feature extraction perspective. However, there is still very little understanding of how appropriate these automated visual prompting strategies are, particularly when compared to humans, across diverse image domains. Additionally, the performance benefits of including such automated visual prompting strategies within the finetuning process of SAM also remains unexplored, as does the effect of interpretable factors like distance between the prompt points on segmentation performance. To bridge these gaps, we leverage a recently released visual prompting dataset, PointPrompt, and introduce a number of benchmarking tasks that provide an array of opportunities to improve the understanding of the way human prompts differ from automated ones and what underlying factors make for effective visual prompts. We demonstrate that the resulting segmentation scores obtained by humans are approximately 29% higher than those given by automated strategies and identify potential features that are indicative of prompting performance with scores over 0.5. Additionally, we demonstrate that performance when using automated methods can be improved by up to 68% via a finetuning approach. Overall, our experiments not only showcase the existing gap between human prompts and automated methods, but also highlight potential avenues through which this gap can be leveraged to improve effective visual prompt design. Further details along with the dataset links and codes are available at https://github.com/olivesgatech/PointPrompt

Paper Structure

This paper contains 20 sections, 5 figures, 4 tables.

Figures (5)

  • Figure 1: Our paper aims to (a) compare segmentation performance of automated point prompt generation strategies to human performance, (b) understand whether differences in human and automated performance can be mitigated through a finetuning-based approach, and (c) identify whether features derived from human-generated prompts can be indicative overall of segmentation performance to provide insights into characteristics of effective prompts.
  • Figure 2: Mask outputs when using automated strategies versus a human annotator's output mask
  • Figure 3: Framework for determining optimal number of inclusion and exclusion points, where points are originally sampled based on the average number of inclusion/exclusion points used by the human annotators for a specific image category
  • Figure 4: (b): $R^2$ prediction scores across individual datasets. Top: prediction score using linear regression. Bottom: prediction score using a 2nd-degree polynomial regression. From left to right: prediction using only data features, only prompt features, or both sets of features
  • Figure 5: Visual example of two different strategies in seismic (salt dome) prompting. Top: example strategy with high exclusion margin and high inclusion max spread. Middle: example strategy with low exclusion margin and low inclusion max spread. Bottom: Segmentation ground truth mask.