Performance Estimation for Supervised Medical Image Segmentation Models on Unlabeled Data Using UniverSeg

Jingchen Zou; Jianqiang Li; Gabriel Jimenez; Qing Zhao; Daniel Racoceanu; Matias Cosarinsky; Enzo Ferrante; Guanghui Fu

Performance Estimation for Supervised Medical Image Segmentation Models on Unlabeled Data Using UniverSeg

Jingchen Zou, Jianqiang Li, Gabriel Jimenez, Qing Zhao, Daniel Racoceanu, Matias Cosarinsky, Enzo Ferrante, Guanghui Fu

TL;DR

This work addresses the problem of estimating segmentation model performance on unlabeled clinical data without requiring additional annotations. It introduces the Segmentation Performance Evaluator (SPE), a four-stage framework that trains multiple epoch checkpoints, uses UniverSeg to compute a reverse (pseudo) performance on a labeled training set, and learns a mapping $G$ to predict real performance from pseudo-performance. Across six datasets and six metrics, SPE demonstrates high correlation (around 0.92–0.99) and low mean absolute error (as low as $0.025$ for Dice), indicating reliable annotation-free estimation with no extra training overhead. The method is model- and metric-agnostic, interoperates with standard training pipelines, and has practical potential for deploying medical image segmentation in real-world, data-scarce environments.

Abstract

The performance of medical image segmentation models is usually evaluated using metrics like the Dice score and Hausdorff distance, which compare predicted masks to ground truth annotations. However, when applying the model to unseen data, such as in clinical settings, it is often impractical to annotate all the data, making the model's performance uncertain. To address this challenge, we propose the Segmentation Performance Evaluator (SPE), a framework for estimating segmentation models' performance on unlabeled data. This framework is adaptable to various evaluation metrics and model architectures. Experiments on six publicly available datasets across six evaluation metrics including pixel-based metrics such as Dice score and distance-based metrics like HD95, demonstrated the versatility and effectiveness of our approach, achieving a high correlation (0.956$\pm$0.046) and low MAE (0.025$\pm$0.019) compare with real Dice score on the independent test set. These results highlight its ability to reliably estimate model performance without requiring annotations. The SPE framework integrates seamlessly into any model training process without adding training overhead, enabling performance estimation and facilitating the real-world application of medical image segmentation algorithms. The source code is publicly available

Performance Estimation for Supervised Medical Image Segmentation Models on Unlabeled Data Using UniverSeg

TL;DR

to predict real performance from pseudo-performance. Across six datasets and six metrics, SPE demonstrates high correlation (around 0.92–0.99) and low mean absolute error (as low as

for Dice), indicating reliable annotation-free estimation with no extra training overhead. The method is model- and metric-agnostic, interoperates with standard training pipelines, and has practical potential for deploying medical image segmentation in real-world, data-scarce environments.

Abstract

0.046) and low MAE (0.025

0.019) compare with real Dice score on the independent test set. These results highlight its ability to reliably estimate model performance without requiring annotations. The SPE framework integrates seamlessly into any model training process without adding training overhead, enabling performance estimation and facilitating the real-world application of medical image segmentation algorithms. The source code is publicly available

Performance Estimation for Supervised Medical Image Segmentation Models on Unlabeled Data Using UniverSeg

TL;DR

Abstract

Performance Estimation for Supervised Medical Image Segmentation Models on Unlabeled Data Using UniverSeg

TL;DR

Abstract

Paper Structure

Table of Contents

Figures (8)