Table of Contents
Fetching ...

Good Enough: Is it Worth Improving your Label Quality?

Alexander Jaus, Zdravko Marinov, Constantin Seibold, Simon Reiß, Jens Kleesiek, Rainer Stiefelhagen

TL;DR

This work investigates whether investing in higher-quality labels for medical CT segmentation is worthwhile. By generating seven pseudo-label datasets from diverse predictors (including nnU-Net, TotalSegmentator, MedSAM, and STU-Net variants) across five base CT datasets, the authors create an independent benchmark to study label quality effects on in-domain performance and pre-training transfer. They find that in-domain gains track label quality and can be substantial, but small improvements yield dataset-dependent or negligible benefits, while pre-training benefits are largely insensitive to label quality. The study concludes that label refinement should be prioritized for in-domain segmentation tasks where substantial improvements are achievable, whereas its value for pre-training transfer is limited.

Abstract

Improving label quality in medical image segmentation is costly, but its benefits remain unclear. We systematically evaluate its impact using multiple pseudo-labeled versions of CT datasets, generated by models like nnU-Net, TotalSegmentator, and MedSAM. Our results show that while higher-quality labels improve in-domain performance, gains remain unclear if below a small threshold. For pre-training, label quality has minimal impact, suggesting that models rather transfer general concepts than detailed annotations. These findings provide guidance on when improving label quality is worth the effort.

Good Enough: Is it Worth Improving your Label Quality?

TL;DR

This work investigates whether investing in higher-quality labels for medical CT segmentation is worthwhile. By generating seven pseudo-label datasets from diverse predictors (including nnU-Net, TotalSegmentator, MedSAM, and STU-Net variants) across five base CT datasets, the authors create an independent benchmark to study label quality effects on in-domain performance and pre-training transfer. They find that in-domain gains track label quality and can be substantial, but small improvements yield dataset-dependent or negligible benefits, while pre-training benefits are largely insensitive to label quality. The study concludes that label refinement should be prioritized for in-domain segmentation tasks where substantial improvements are achievable, whereas its value for pre-training transfer is limited.

Abstract

Improving label quality in medical image segmentation is costly, but its benefits remain unclear. We systematically evaluate its impact using multiple pseudo-labeled versions of CT datasets, generated by models like nnU-Net, TotalSegmentator, and MedSAM. Our results show that while higher-quality labels improve in-domain performance, gains remain unclear if below a small threshold. For pre-training, label quality has minimal impact, suggesting that models rather transfer general concepts than detailed annotations. These findings provide guidance on when improving label quality is worth the effort.

Paper Structure

This paper contains 8 sections, 4 equations, 2 figures, 3 tables.

Figures (2)

  • Figure 1: Overview of model prediction in yellow and the ground truth in green. STU-Net predictions tend to improve with model size, keeping the types of errors the models make constant, e.g. over-segmentation of Couinaud's liver segment VI (red circle) and iterative improvement in segment IV (red arrows). Best seen on screen with zoom.
  • Figure 2: In-Domain evaluation results for Dice ("$\times$") and Surface Dice ("$\Updelta$"). We include dashed $y=x$ for reference and zoom into areas if markers are cluttered.