Table of Contents
Fetching ...

nnActive: A Framework for Evaluation of Active Learning in 3D Biomedical Segmentation

Carsten T. Lüth, Jeremias Traub, Kim-Celine Kahl, Till J. Bungert, Lukas Klein, Lars Krämer, Paul F. Jaeger, Fabian Isensee, Klaus Maier-Hein

TL;DR

nnActive delivers a rigorous, open-source framework to evaluate active learning for 3D biomedical segmentation across multiple datasets and budget regimes, addressing prior methodological pitfalls. It integrates partial-annotation training on 3D patches within an enhanced nnU-Net, introduces Foreground Aware Random baselines, and proposes the FG-Eff metric to better capture annotation effort. The large-scale study reveals that while AL methods outperform naive Random sampling, foreground-aware random baselines often challenge AL, with Predictive Entropy being a strong but variable performer. The work provides practical guidelines and a robust benchmark to catalyze further research and application of AL in 3D biomedical imaging.

Abstract

Semantic segmentation is crucial for various biomedical applications, yet its reliance on large annotated datasets presents a bottleneck due to the high cost and specialized expertise required for manual labeling. Active Learning (AL) aims to mitigate this challenge by querying only the most informative samples, thereby reducing annotation effort. However, in the domain of 3D biomedical imaging, there is no consensus on whether AL consistently outperforms Random sampling. Four evaluation pitfalls hinder the current methodological assessment. These are (1) restriction to too few datasets and annotation budgets, (2) using 2D models on 3D images without partial annotations, (3) Random baseline not being adapted to the task, and (4) measuring annotation cost only in voxels. In this work, we introduce nnActive, an open-source AL framework that overcomes these pitfalls by (1) means of a large scale study spanning four biomedical imaging datasets and three label regimes, (2) extending nnU-Net by using partial annotations for training with 3D patch-based query selection, (3) proposing Foreground Aware Random sampling strategies tackling the foreground-background class imbalance of medical images and (4) propose the foreground efficiency metric, which captures the low annotation cost of background-regions. We reveal the following findings: (A) while all AL methods outperform standard Random sampling, none reliably surpasses an improved Foreground Aware Random sampling; (B) benefits of AL depend on task specific parameters; (C) Predictive Entropy is overall the best performing AL method, but likely requires the most annotation effort; (D) AL performance can be improved with more compute intensive design choices. As a holistic, open-source framework, nnActive can serve as a catalyst for research and application of AL in 3D biomedical imaging. Code is at: https://github.com/MIC-DKFZ/nnActive

nnActive: A Framework for Evaluation of Active Learning in 3D Biomedical Segmentation

TL;DR

nnActive delivers a rigorous, open-source framework to evaluate active learning for 3D biomedical segmentation across multiple datasets and budget regimes, addressing prior methodological pitfalls. It integrates partial-annotation training on 3D patches within an enhanced nnU-Net, introduces Foreground Aware Random baselines, and proposes the FG-Eff metric to better capture annotation effort. The large-scale study reveals that while AL methods outperform naive Random sampling, foreground-aware random baselines often challenge AL, with Predictive Entropy being a strong but variable performer. The work provides practical guidelines and a robust benchmark to catalyze further research and application of AL in 3D biomedical imaging.

Abstract

Semantic segmentation is crucial for various biomedical applications, yet its reliance on large annotated datasets presents a bottleneck due to the high cost and specialized expertise required for manual labeling. Active Learning (AL) aims to mitigate this challenge by querying only the most informative samples, thereby reducing annotation effort. However, in the domain of 3D biomedical imaging, there is no consensus on whether AL consistently outperforms Random sampling. Four evaluation pitfalls hinder the current methodological assessment. These are (1) restriction to too few datasets and annotation budgets, (2) using 2D models on 3D images without partial annotations, (3) Random baseline not being adapted to the task, and (4) measuring annotation cost only in voxels. In this work, we introduce nnActive, an open-source AL framework that overcomes these pitfalls by (1) means of a large scale study spanning four biomedical imaging datasets and three label regimes, (2) extending nnU-Net by using partial annotations for training with 3D patch-based query selection, (3) proposing Foreground Aware Random sampling strategies tackling the foreground-background class imbalance of medical images and (4) propose the foreground efficiency metric, which captures the low annotation cost of background-regions. We reveal the following findings: (A) while all AL methods outperform standard Random sampling, none reliably surpasses an improved Foreground Aware Random sampling; (B) benefits of AL depend on task specific parameters; (C) Predictive Entropy is overall the best performing AL method, but likely requires the most annotation effort; (D) AL performance can be improved with more compute intensive design choices. As a holistic, open-source framework, nnActive can serve as a catalyst for research and application of AL in 3D biomedical imaging. Code is at: https://github.com/MIC-DKFZ/nnActive

Paper Structure

This paper contains 85 sections, 4 equations, 25 figures, 20 tables, 1 algorithm.

Figures (25)

  • Figure 1: Visualization of the four Pitfalls (P1-P4) alongside our solutions and how their presence hinders reliable performance assessments of AL methods for 3D biomedical imaging. For visualization purposes, we use 2D slices as partial annotations.
  • Figure 2: PPM aggregated over all experiments of the main study. At each position $(i, j)$ the values indicate the fraction of pairwise comparisons in % where method $i$ significantly outperformed method $j$.
  • Figure 3: A detailed view into the Win-/Lose-ratios of AL methods in the PPM (\ref{['fig:main-ppm']}) for the main study against Random (a) and Random 66%FG (b). All AL methods outperform Random substantially more often than being outperformed with Noisy QMs, showcasing no Lose-scenarios (a). However, only Predictive Entropy outperforms Random 66% FG slightly more often than it is outperformed (b).
  • Figure 4: Ranking of methods according to AUBC, Final Dice and FG-Eff for each dataset and its Label Regimes (Low, Medium & High) alongside mean with standard deviations (bar). The trend across datasets with regard to the benefit of AL differs over Foreground Aware Random strategies. On AMOS we observe no benefits when using AL across all Label Regimes whereas on KiTS and Hippocampus AL methods lead to performance improvements and a more neutral result for ACDC. Further, we observe a trend with regard to different Label Regimes where Noisy QMs outperform their Greedy counterparts (e.g. PowerBALD and BALD) on the Low-Label Regime.
  • Figure 5: Does longer training improve AL performance?$\Delta \text{Final Dice} = (\text{Final Dice(500 Epochs)} - \text{Final Dice(Precomputed)})\times 100$. Positive values indicate that longer training leads to better queries even when accounting for performance differences stemming from longer training. Dark colors indicate the significance of a two-sided t-test ($\alpha = 0.1$).
  • ...and 20 more figures