Table of Contents
Fetching ...

Every Component Counts: Rethinking the Measure of Success for Medical Semantic Segmentation in Multi-Instance Segmentation Tasks

Alexander Jaus, Constantin Seibold, Simon Reiß, Zdravko Marinov, Keyi Li, Zeling Ye, Stefan Krieg, Jens Kleesiek, Rainer Stiefelhagen

TL;DR

This work proposes to evaluate existing semantic segmentation metrics on a per-component basis thus giving each tumor the same weight irrespective of its size, and breaks free of biases introduced by large metastasis for overlap-based metrics such as Dice or Surface Dice.

Abstract

We present Connected-Component~(CC)-Metrics, a novel semantic segmentation evaluation protocol, targeted to align existing semantic segmentation metrics to a multi-instance detection scenario in which each connected component matters. We motivate this setup in the common medical scenario of semantic metastases segmentation in a full-body PET/CT. We show how existing semantic segmentation metrics suffer from a bias towards larger connected components contradicting the clinical assessment of scans in which tumor size and clinical relevance are uncorrelated. To rebalance existing segmentation metrics, we propose to evaluate them on a per-component basis thus giving each tumor the same weight irrespective of its size. To match predictions to ground-truth segments, we employ a proximity-based matching criterion, evaluating common metrics locally at the component of interest. Using this approach, we break free of biases introduced by large metastasis for overlap-based metrics such as Dice or Surface Dice. CC-Metrics also improves distance-based metrics such as Hausdorff Distances which are uninformative for small changes that do not influence the maximum or 95th percentile, and avoids pitfalls introduced by directly combining counting-based metrics with overlap-based metrics as it is done in Panoptic Quality.

Every Component Counts: Rethinking the Measure of Success for Medical Semantic Segmentation in Multi-Instance Segmentation Tasks

TL;DR

This work proposes to evaluate existing semantic segmentation metrics on a per-component basis thus giving each tumor the same weight irrespective of its size, and breaks free of biases introduced by large metastasis for overlap-based metrics such as Dice or Surface Dice.

Abstract

We present Connected-Component~(CC)-Metrics, a novel semantic segmentation evaluation protocol, targeted to align existing semantic segmentation metrics to a multi-instance detection scenario in which each connected component matters. We motivate this setup in the common medical scenario of semantic metastases segmentation in a full-body PET/CT. We show how existing semantic segmentation metrics suffer from a bias towards larger connected components contradicting the clinical assessment of scans in which tumor size and clinical relevance are uncorrelated. To rebalance existing segmentation metrics, we propose to evaluate them on a per-component basis thus giving each tumor the same weight irrespective of its size. To match predictions to ground-truth segments, we employ a proximity-based matching criterion, evaluating common metrics locally at the component of interest. Using this approach, we break free of biases introduced by large metastasis for overlap-based metrics such as Dice or Surface Dice. CC-Metrics also improves distance-based metrics such as Hausdorff Distances which are uninformative for small changes that do not influence the maximum or 95th percentile, and avoids pitfalls introduced by directly combining counting-based metrics with overlap-based metrics as it is done in Panoptic Quality.

Paper Structure

This paper contains 33 sections, 1 theorem, 14 equations, 6 figures, 1 table, 1 algorithm.

Key Result

Theorem 1

The generalized Voronoi Diagram as stated in Def. 4 is a unique separation of a metric space.

Figures (6)

  • Figure 1: Reporting a Dice of 98% in the shown example, highly overestimates the capability of the trained semantic segmentation model. This might leave radiologists with a false impression of how reliably the model can be used to predict tumors in body scans. With CC-Metrics, we partition an image into distinct regions and evaluate standard semantic segmentation metrics on a per-component basis which gives each tumor the same importance.
  • Figure 2: Comparison of Dice, CC-Dice and Panoptic Quality: In the upper plot we start from a perfect prediction and degrade prediction quality by applying erosion to all components uniformly. In the middle plot we only degrade the prediction of the smallest mask. In the lower plot, we compare CC-Dice with Lesion Dice (LD) by using dilation to simulate oversegmentation (left) and highlight a pitfall of LD (right).
  • Figure 3: Comparison of the standard Hausdorff95 metric with the CC-Hausdorff95 metric (upper plot), as well as standard Surface Dice with CC-Surface Dice (lower plot). In both scenarios, we start from a perfect prediction and assess the metric scores while degrading the prediction quality of a large versus a small component.
  • Figure 4: Comparison of standard and CC semantic segmentation metrics on the AutoPET dataset across multiple scenarios
  • Figure 5: Comparison of standard and CC semantic segmentation metrics on the AutoPET dataset across various scenarios. In this setting, we want to evaluate CC-Metrics on a constant subset of patients. Thus to drop a maximum of $n=10$ components, we include only patients with at least $11$ metastases for all measurements in this scenario including the ones where we drop less than $10$ components. For all other scenarios, we do not reduce the number of components during the prediction degradation and thus consider patients with at least $10$ components.
  • ...and 1 more figures

Theorems & Definitions (5)

  • Definition 1
  • Definition 2
  • Definition 3
  • Definition 4
  • Theorem 1