Table of Contents
Fetching ...

Towards Ground-truth-free Evaluation of Any Segmentation in Medical Images

Ahjol Senbi, Tianyu Huang, Fei Lyu, Qing Li, Yuhui Tao, Wei Shao, Qiang Chen, Chengyan Wang, Shuo Wang, Tao Zhou, Yizhe Zhang

TL;DR

This work tackles the challenge of evaluating medical image segmentations without ground-truth masks by introducing EvanySeg, a ground-truth-free evaluator that predicts segmentation quality for SAM and its variants. The authors justify the approach theoretically, construct training data using multiple SAM variants, and implement a regression-based architecture (preprocessing + ViT/ResNet backbones) to output quality scores. Extensive experiments across diverse ultrasound, CT, and in-house datasets show strong correlations between predicted and true Dice scores, with ViT backbones generally outperforming CNNs. EvanySeg is demonstrated to identify poor segmentations, benchmark models without ground truth, and enable sample-wise model selection, facilitating more reliable and efficient human-AI collaboration in medical imaging.

Abstract

We explore the feasibility and potential of building a ground-truth-free evaluation model to assess the quality of segmentations generated by the Segment Anything Model (SAM) and its variants in medical imaging. This evaluation model estimates segmentation quality scores by analyzing the coherence and consistency between the input images and their corresponding segmentation predictions. Based on prior research, we frame the task of training this model as a regression problem within a supervised learning framework, using Dice scores (and optionally other metrics) along with mean squared error to compute the training loss. The model is trained utilizing a large collection of public datasets of medical images with segmentation predictions from SAM and its variants. We name this model EvanySeg (Evaluation of Any Segmentation in Medical Images). Our exploration of convolution-based models (e.g., ResNet) and transformer-based models (e.g., ViT) suggested that ViT yields better performance for this task. EvanySeg can be employed for various tasks, including: (1) identifying poorly segmented samples by detecting low-percentile segmentation quality scores; (2) benchmarking segmentation models without ground truth by averaging quality scores across test samples; (3) alerting human experts to poor-quality segmentation predictions during human-AI collaboration by applying a threshold within the score space; and (4) selecting the best segmentation prediction for each test sample at test time when multiple segmentation models are available, by choosing the prediction with the highest quality score. Models and code will be made available at https://github.com/ahjolsenbics/EvanySeg.

Towards Ground-truth-free Evaluation of Any Segmentation in Medical Images

TL;DR

This work tackles the challenge of evaluating medical image segmentations without ground-truth masks by introducing EvanySeg, a ground-truth-free evaluator that predicts segmentation quality for SAM and its variants. The authors justify the approach theoretically, construct training data using multiple SAM variants, and implement a regression-based architecture (preprocessing + ViT/ResNet backbones) to output quality scores. Extensive experiments across diverse ultrasound, CT, and in-house datasets show strong correlations between predicted and true Dice scores, with ViT backbones generally outperforming CNNs. EvanySeg is demonstrated to identify poor segmentations, benchmark models without ground truth, and enable sample-wise model selection, facilitating more reliable and efficient human-AI collaboration in medical imaging.

Abstract

We explore the feasibility and potential of building a ground-truth-free evaluation model to assess the quality of segmentations generated by the Segment Anything Model (SAM) and its variants in medical imaging. This evaluation model estimates segmentation quality scores by analyzing the coherence and consistency between the input images and their corresponding segmentation predictions. Based on prior research, we frame the task of training this model as a regression problem within a supervised learning framework, using Dice scores (and optionally other metrics) along with mean squared error to compute the training loss. The model is trained utilizing a large collection of public datasets of medical images with segmentation predictions from SAM and its variants. We name this model EvanySeg (Evaluation of Any Segmentation in Medical Images). Our exploration of convolution-based models (e.g., ResNet) and transformer-based models (e.g., ViT) suggested that ViT yields better performance for this task. EvanySeg can be employed for various tasks, including: (1) identifying poorly segmented samples by detecting low-percentile segmentation quality scores; (2) benchmarking segmentation models without ground truth by averaging quality scores across test samples; (3) alerting human experts to poor-quality segmentation predictions during human-AI collaboration by applying a threshold within the score space; and (4) selecting the best segmentation prediction for each test sample at test time when multiple segmentation models are available, by choosing the prediction with the highest quality score. Models and code will be made available at https://github.com/ahjolsenbics/EvanySeg.
Paper Structure (19 sections, 4 theorems, 4 equations, 15 figures, 3 tables)

This paper contains 19 sections, 4 theorems, 4 equations, 15 figures, 3 tables.

Key Result

Theorem 3.1

Given an image, estimating the segmentation quality (e.g., Dice score) of a segmentation map (Problem A) for this image is no harder than generating a perfectly accurate segmentation map (Problem B) for this image.

Figures (15)

  • Figure 1: $\textrm{EvanySeg}$ is a companion model to SAM and its variants, designed to enhance reliability and trustworthiness in the deployment of SAM (and its variants) on medical images.
  • Figure 2: An illustration of the training data in a scenario where bounding box was used as prompt. The regions of interest are defined by the bounding box prompts.
  • Figure 3: Samples from the MRI-Kidney100 dataset
  • Figure 4: Samples from the CT-BCT100 dataset
  • Figure 5: Samples from the Endo-Polyp1000 dataset.
  • ...and 10 more figures

Theorems & Definitions (7)

  • Theorem 3.1
  • Theorem 3.2
  • Definition 3.1
  • Definition 3.2
  • Proposition 3.3
  • Definition 3.3
  • Proposition 3.4