Comparative Benchmarking of Failure Detection Methods in Medical Image Segmentation: Unveiling the Role of Confidence Aggregation

Maximilian Zenk; David Zimmerer; Fabian Isensee; Jeremias Traub; Tobias Norajitra; Paul F. Jäger; Klaus Maier-Hein

Comparative Benchmarking of Failure Detection Methods in Medical Image Segmentation: Unveiling the Role of Confidence Aggregation

Maximilian Zenk, David Zimmerer, Fabian Isensee, Jeremias Traub, Tobias Norajitra, Paul F. Jäger, Klaus Maier-Hein

TL;DR

A comprehensive benchmarking framework aimed at evaluating failure detection methodologies within medical image segmentation, highlighting the importance of pixel confidence aggregation and observing superior performance of the pairwise Dice score between ensemble predictions.

Abstract

Semantic segmentation is an essential component of medical image analysis research, with recent deep learning algorithms offering out-of-the-box applicability across diverse datasets. Despite these advancements, segmentation failures remain a significant concern for real-world clinical applications, necessitating reliable detection mechanisms. This paper introduces a comprehensive benchmarking framework aimed at evaluating failure detection methodologies within medical image segmentation. Through our analysis, we identify the strengths and limitations of current failure detection metrics, advocating for the risk-coverage analysis as a holistic evaluation approach. Utilizing a collective dataset comprising five public 3D medical image collections, we assess the efficacy of various failure detection strategies under realistic test-time distribution shifts. Our findings highlight the importance of pixel confidence aggregation and we observe superior performance of the pairwise Dice score (Roy et al., 2019) between ensemble predictions, positioning it as a simple and robust baseline for failure detection in medical image segmentation. To promote ongoing research, we make the benchmarking framework available to the community.

Comparative Benchmarking of Failure Detection Methods in Medical Image Segmentation: Unveiling the Role of Confidence Aggregation

TL;DR

Abstract

Paper Structure (34 sections, 3 equations, 19 figures, 5 tables)

This paper contains 34 sections, 3 equations, 19 figures, 5 tables.

Introduction
Realistic Evaluation of Failure Detection Methods
Task Definition
Requirements on the Evaluation Protocol
Related Work
Image-level Failure Detection Methods
Segmentation Quality Estimation
Distribution Shift Detection
Pixel-level Uncertainty Methods
Aggregation of Pixel-level Uncertainties
Benchmarking Efforts for Segmentation Failure Detection
Materials and Methods
Evaluation
Datasets
Segmentation algorithm
...and 19 more sections

Figures (19)

Figure 1: Overview of the research questions and contributions of this paper. Based on a formal definition of the image-level failure detection task, we formulate requirements for the evaluation protocol. Existing failure detection metrics are compared and the risk-coverage analysis is identified as a suitable evaluation protocol. We then propose a benchmarking framework for failure detection in medical image segmentation, which includes a diverse pool of 3D medical image datasets. A wide range of relevant methods are compared, including lines of research for image-level confidence and aggregated pixel confidence, which have been mostly studied in separation so far.
Figure 2: Segmentation performance of a single U-Net on the test sets. Boxes show the median and IQR, while whiskers extend to the 5th and 95th percentiles, respectively. Each dataset contains samples drawn from the same distribution as the training set (in-distribution, ID) and samples drawn from a different data distribution (dataset shift) with the same structures to be segmented. Usually, the performance on the in-distribution samples is higher than on the samples with distribution shift, but especially for the Kidney tumor (which lacks dataset shifts) and Covid datasets, there are also several in-distribution failure cases.
Figure 3: Comparison of aggregation methods in terms of AURC scores for all datasets (lower is better). The experiments are named as "prediction model + confidence method" and each of them was repeated using 5 folds. Colored markers denote AURC values achieved by the methods, while gray marks above/below them are AURC values for random/optimal confidence rankings (which differ between the models trained on different folds; see \ref{['sec:methods_eval']}). Pairwise DSC scores consistently best, but does not apply to single network outputs. Aggregation methods based on regression forests (RF) also show performance gains compared to the mean PE baseline, but fail catastrophically on the prostate dataset, possibly due to the small training set size. PE: predictive entropy. RF: regression forest.
Figure 4: Rankings by average AURC (top, lower ranks are better) and the underlying AURC scores (bottom; lower is better) for all datasets and methods. The experiments are named as "prediction model + confidence method" and each of them was repeated using 5 folds. In the lower diagram, colored dots denote AURC values achieved by the methods, while gray marks above/below them are AURC values for random/optimal confidence rankings (which differ between the models trained on different folds; see \ref{['sec:methods_eval']}). Most of the aggregation methods from \ref{['fig:overview_aurc_aggregation']} were excluded for clarity, as they perform worse than pairwise DSC. Ensemble + pairwise DSC is the best method overall, often achieving close to optimal AURC scores. The ranking on the prostate dataset is an outlier, which could be due to the small training set size. PE: predictive entropy.
Figure 5: Impact of the choice of segmentation metric as a risk function on the ranking stability, comparing mean DSC (left) and NSD (right). Bootstrapping ($N=500$) was used to obtain a distribution of ranks for the results of each fold and the ranking distributions of all folds were accumulated. All ranks across datasets are combined in this figure, where the circle area is proportional to the rank count and the black x-markers indicate median ranks, which were also used to sort the methods. Overall, the ranking distributions are similar for mean DSC and NSD. The variance in the ranking distributions largely originates from combining the rankings across datasets, so for each dataset individually the ranking is more stable (see for example the Covid dataset in \ref{['fig:compare_surface_dice_520']}).
...and 14 more figures

Comparative Benchmarking of Failure Detection Methods in Medical Image Segmentation: Unveiling the Role of Confidence Aggregation

TL;DR

Abstract

Comparative Benchmarking of Failure Detection Methods in Medical Image Segmentation: Unveiling the Role of Confidence Aggregation

Authors

TL;DR

Abstract

Table of Contents

Figures (19)