USE-Evaluator: Performance Metrics for Medical Image Segmentation Models with Uncertain, Small or Empty Reference Annotations

Sophie Ostmeier; Brian Axelrod; Jeroen Bertels; Fabian Isensee; Maarten G. Lansberg; Soren Christensen; Gregory W. Albers; Li-Jia Li; Jeremy J. Heit

USE-Evaluator: Performance Metrics for Medical Image Segmentation Models with Uncertain, Small or Empty Reference Annotations

Sophie Ostmeier, Brian Axelrod, Jeroen Bertels, Fabian Isensee, Maarten G. Lansberg, Soren Christensen, Gregory W. Albers, Li-Jia Li, Jeremy J. Heit

TL;DR

This work studies how uncertain, small, and empty reference annotations influence the value of metrics on a stroke in-house data set regardless of the model, and compares the results to the BRATS 2019 and Spinal Cord public data sets.

Abstract

Performance metrics for medical image segmentation models are used to measure the agreement between the reference annotation and the predicted segmentation. Usually, overlap metrics, such as the Dice, are used as a metric to evaluate the performance of these models in order for results to be comparable. However, there is a mismatch between the distributions of cases and difficulty level of segmentation tasks in public data sets compared to clinical practice. Common metrics fail to measure the impact of this mismatch, especially for clinical data sets that include low signal pathologies, a difficult segmentation task, and uncertain, small, or empty reference annotations. This limitation may result in ineffective research of machine learning practitioners in designing and optimizing models. Dimensions of evaluating clinical value include consideration of the uncertainty of reference annotations, independence from reference annotation volume size, and evaluation of classification of empty reference annotations. We study how uncertain, small, and empty reference annotations influence the value of metrics for medical image segmentation on an in-house data set regardless of the model. We examine metrics behavior on the predictions of a standard deep learning framework in order to identify metrics with clinical value. We compare to a public benchmark data set (BraTS 2019) with a high-signal pathology and certain, larger, and no empty reference annotations. We may show machine learning practitioners, how uncertain, small, or empty reference annotations require a rethinking of the evaluation and optimizing procedures. The evaluation code was released to encourage further analysis of this topic. https://github.com/SophieOstmeier/UncertainSmallEmpty.git

USE-Evaluator: Performance Metrics for Medical Image Segmentation Models with Uncertain, Small or Empty Reference Annotations

TL;DR

Abstract

Paper Structure (35 sections, 9 equations, 7 figures, 7 tables)

This paper contains 35 sections, 9 equations, 7 figures, 7 tables.

Introduction
Uncertain Reference Annotations
Small Reference Annotations
Empty Reference Annotations
Clinical Value
Metrics
Fundamentals
Surface Dice at Tolerance
Uncertainty Score
Voxel-level Class Imbalances
Class Imbalances of Segmentation
Image-level Class Imbalances
Methods
Data Sets
Data Partition
...and 20 more sections

Figures (7)

Figure 1: Example of true positives ($TP_i$), true negatives ($TN_i$), false positive ($FP_i$) and false negative ($FN_i$) cases for the NCCT and BRATS data set for a threshold of 1ml.
Figure 2: Scatter plot with log-scale and confusion matrix with a volume threshold of 1ml dividing $TP$ and $TN$ from $FP$ and $FN$. For the NCCT data set(violet points), almost all incorrectly classified cases are too small, namely $FN$, whereas for the BRATS non-enhancing tumor data set the opposite is the case. None of the cases of BRATS whole tumor are incorrectly classified.
Figure 3: Dot plot with regression lines for the Dice over class imbalance $p$ for all segmentation models, where $p = \frac{1}{1+IR}$. The gray areas represent 95% confidence intervals. The dark red dots and line represent the random model with the expected Dice $E_D$ defined \ref{['def:randommodel']}. The dashed line indicates the expected Dice $E_D$ for a balanced reference mask.
Figure 4: Correlation matrices of Spearman coefficient for data sets and metrics. X indicates insignificant correlations with $p>0.05$. Overall correlation patterns among metrics (e.g. Dice and SDT) remain similar over the data sets. The correlation between Dice and uncertainty, as well as the reference volume, is reproducible in all datasets, albeit to varying degrees.
Figure 5: Data Sampling and Partition of 5-fold-Cross-Validation
...and 2 more figures

USE-Evaluator: Performance Metrics for Medical Image Segmentation Models with Uncertain, Small or Empty Reference Annotations

TL;DR

Abstract

USE-Evaluator: Performance Metrics for Medical Image Segmentation Models with Uncertain, Small or Empty Reference Annotations

Authors

TL;DR

Abstract

Table of Contents

Figures (7)