Table of Contents
Fetching ...

Quantifying the Limits of Segmentation Foundation Models: Modeling Challenges in Segmenting Tree-Like and Low-Contrast Objects

Yixin Zhang, Nicholas Konz, Kevin Kramer, Maciej A. Mazurowski

TL;DR

This paper identifies fundamental failure modes of segmentation foundation models (SFMs) when handling tree-like and low-contrast objects. It introduces two interpretable metrics, Contour Pixel Rate and Difference of Gini Impurity Deviation, to quantify object tree-likeness, and a textural separability metric based on early neural features to quantify texture differences with the background. Across carefully controlled synthetic and real datasets, the study shows strong, consistent correlations between these object characteristics and SFM IoU, and finds that fine-tuning the models does not eliminate the problems. The findings reveal that SFMs tend to over-segment or misclassify tree-like and low-texture objects due to how patch-based attention interprets local structure as texture, with important implications for model design and evaluation in applications requiring robust segmentation of complex shapes and textures.

Abstract

Image segmentation foundation models (SFMs) like Segment Anything Model (SAM) have achieved impressive zero-shot and interactive segmentation across diverse domains. However, they struggle to segment objects with certain structures, particularly those with dense, tree-like morphology and low textural contrast from their surroundings. These failure modes are crucial for understanding the limitations of SFMs in real-world applications. To systematically study this issue, we introduce interpretable metrics quantifying object tree-likeness and textural separability. On carefully controlled synthetic experiments and real-world datasets, we show that SFM performance (\eg, SAM, SAM 2, HQ-SAM) noticeably correlates with these factors. We attribute these failures to SFMs misinterpreting local structure as global texture, resulting in over-segmentation or difficulty distinguishing objects from similar backgrounds. Notably, targeted fine-tuning fails to resolve this issue, indicating a fundamental limitation. Our study provides the first quantitative framework for modeling the behavior of SFMs on challenging structures, offering interpretable insights into their segmentation capabilities.

Quantifying the Limits of Segmentation Foundation Models: Modeling Challenges in Segmenting Tree-Like and Low-Contrast Objects

TL;DR

This paper identifies fundamental failure modes of segmentation foundation models (SFMs) when handling tree-like and low-contrast objects. It introduces two interpretable metrics, Contour Pixel Rate and Difference of Gini Impurity Deviation, to quantify object tree-likeness, and a textural separability metric based on early neural features to quantify texture differences with the background. Across carefully controlled synthetic and real datasets, the study shows strong, consistent correlations between these object characteristics and SFM IoU, and finds that fine-tuning the models does not eliminate the problems. The findings reveal that SFMs tend to over-segment or misclassify tree-like and low-texture objects due to how patch-based attention interprets local structure as texture, with important implications for model design and evaluation in applications requiring robust segmentation of complex shapes and textures.

Abstract

Image segmentation foundation models (SFMs) like Segment Anything Model (SAM) have achieved impressive zero-shot and interactive segmentation across diverse domains. However, they struggle to segment objects with certain structures, particularly those with dense, tree-like morphology and low textural contrast from their surroundings. These failure modes are crucial for understanding the limitations of SFMs in real-world applications. To systematically study this issue, we introduce interpretable metrics quantifying object tree-likeness and textural separability. On carefully controlled synthetic experiments and real-world datasets, we show that SFM performance (\eg, SAM, SAM 2, HQ-SAM) noticeably correlates with these factors. We attribute these failures to SFMs misinterpreting local structure as global texture, resulting in over-segmentation or difficulty distinguishing objects from similar backgrounds. Notably, targeted fine-tuning fails to resolve this issue, indicating a fundamental limitation. Our study provides the first quantitative framework for modeling the behavior of SFMs on challenging structures, offering interpretable insights into their segmentation capabilities.

Paper Structure

This paper contains 45 sections, 6 equations, 23 figures, 10 tables, 3 algorithms.

Figures (23)

  • Figure 1: SAM's segmentation performance tends to drop noticeably when the object has high tree-likeness (left, on DIS) or low textural separability (right, on iShape)--even with heavy prompting--which we investigate in this work.
  • Figure 2: Example retinal blood vessel (top) and satellite road (bottom) images and accompanying object segmentation masks.
  • Figure 3: Left: Example synthetic tree-like images and object masks. Right: Trend of increasing tree-likeness (increasing CPR/decreasing DoGD) of these objects (left in each pair) resulting in worse SAM segmentation predictions (right in each pair)
  • Figure 4: Left: SFM prediction IoU vs. object tree-likeness (CPR and DoGD), on the synthetic dataset, shown for SAM-H. Right: Rank correlations between IoU and tree-likeness for all SFMs.
  • Figure 5: Left: Segmentation IoU vs. object tree-likeness (CPR, top; DoGD, bottom) on DIS (left) for all three SFMs and iShape (right) shown for SAM-H. Right: rank correlations between IoU and tree-likeness for all SFMs.
  • ...and 18 more figures

Theorems & Definitions (1)

  • Definition 4.1: Contour Pixels