Quality Sentinel: Estimating Label Quality and Errors in Medical Segmentation Datasets

Yixiong Chen; Zongwei Zhou; Alan Yuille

Quality Sentinel: Estimating Label Quality and Errors in Medical Segmentation Datasets

Yixiong Chen, Zongwei Zhou, Alan Yuille

TL;DR

Quality Sentinel addresses the lack of automatic, cross-class label-quality QC in medical segmentation by introducing a CLIP-conditioned regression model that predicts label quality via DSC across 142 organs. It combines a 2D slice-based DSC estimator with a scale-insensitive optimal pair ranking loss, enabling both accurate quality estimation and effective sample selection for active learning and semi-supervised training. The approach reveals dataset-wide quality variability and biases (e.g., by gender and age), and demonstrates significant data-efficient improvements over entropy- and diversity-based baselines, including strong cross-domain performance on external datasets. By releasing data and models, this work proposes a practical pathway to robust, scalable QC for large-scale medical segmentation datasets and highlights avenues to extend to more modalities and broader clinical contexts.

Abstract

An increasing number of public datasets have shown a transformative impact on automated medical segmentation. However, these datasets are often with varying label quality, ranging from manual expert annotations to AI-generated pseudo-annotations. There is no systematic, reliable, and automatic quality control (QC). To fill in this bridge, we introduce a regression model, Quality Sentinel, to estimate label quality compared with manual annotations in medical segmentation datasets. This regression model was trained on over 4 million image-label pairs created by us. Each pair presents a varying but quantified label quality based on manual annotations, which enable us to predict the label quality of any image-label pairs in the inference. Our Quality Sentinel can predict the label quality of 142 body structures. The predicted label quality quantified by Dice Similarity Coefficient (DSC) shares a strong correlation with ground truth quality, with a positive correlation coefficient (r=0.902). Quality Sentinel has found multiple impactful use cases. (I) We evaluated label quality in publicly available datasets, where quality highly varies across different datasets. Our analysis also uncovers that male and younger subjects exhibit significantly higher quality. (II) We identified and corrected poorly annotated labels, achieving 1/3 reduction in annotation costs with optimal budgeting on TotalSegmentator. (III) We enhanced AI training efficiency and performance by focusing on high-quality pseudo labels, resulting in a 33%--88% performance boost over entropy-based methods, with a cost of 31% time and 4.5% memory. The data and model are released.

Quality Sentinel: Estimating Label Quality and Errors in Medical Segmentation Datasets

TL;DR

Abstract

Paper Structure (16 sections, 3 equations, 6 figures, 3 tables)

This paper contains 16 sections, 3 equations, 6 figures, 3 tables.

Introduction
Related Work
Multi-Organ Segmentation
Medical Segmentation Quality Control
Language-driven Medical Image Analysis
Quality Sentinel
Formulation of the Quality Estimation Framework
Text-driven Condition Module
Optimal Pair Ranking Loss
Experiments & Results
Experimental Settings
Evaluation of Quality Sentinel
Dataset Evaluation with Quality Sentinel
Data Efficient Training
Discussion and Limitations
...and 1 more sections

Figures (6)

Figure 1: Overview of Quality Sentinel: Inputs are image-label pairs with pseudo labels. It employs a vision encoder to estimate DSC relative to ground truths. The Text-driven Condition Module encodes texts to augment the model for organ identification, addressing the ambiguity of organ-label matching (e.g., for $(I_k, S_k)$ at upper left, the mask covers two organs). Training involves a compositional loss, combining optimal pair ranking and MSE, to align predicted with actual DSC.
Figure 2: Illustration of (a) the curves of model performance and data amount and (b) the curves of performance deviation w.r.t. slice amount compared to prediction with all slices.
Figure 3: Evaluation of model performance. The scatter plots are illustrated between predicted DSC and ground truth DSC of Quality Sentinel testing set (a) and external evaluation on BTCV with erosion mask degeneration (b) and dilation degeneration (c).
Figure 4: Illustration of labels with lower (left) and higher (right) predicted DSC in AbdomenAtlas dataset. Typical label errors detected include reduced masks (the first two of the liver), missing masks (the first one of the pancreas), and expanded masks (the first three of the gallbladder).
Figure 5: The annotation bias w.r.t. gender (left) and age (right) for TotalSegmentator. We show several representative organs with higher label quality for males than females and find a significant overall quality difference for different genders. In addition, older patients generally have worse label quality than younger patients.
...and 1 more figures

Quality Sentinel: Estimating Label Quality and Errors in Medical Segmentation Datasets

TL;DR

Abstract

Quality Sentinel: Estimating Label Quality and Errors in Medical Segmentation Datasets

Authors

TL;DR

Abstract

Table of Contents

Figures (6)