Table of Contents
Fetching ...

OpenMIBOOD: Open Medical Imaging Benchmarks for Out-Of-Distribution Detection

Max Gutbrod, David Rauber, Danilo Weber Nunes, Christoph Palm

TL;DR

OpenMIBOOD introduces three medical-imaging OOD benchmarks (MIDOG, PHAKIR, OASIS3) comprising 14 datasets to systematically evaluate post-hoc OOD detectors under csID, nOOD, and fOOD conditions. Using an OpenOOD-inspired framework, the study finds that methods trained on natural images do not generalize well to medical data, with feature-space approaches such as MDSEns and ViM outperforming probability-based methods in most cases. The authors provide standardized dataset splits, metrics (AUROC, FPR@95, AUPRIN/AUPROUT, harmonic mean), and a public codebase, revealing dataset-specific challenges that limit OOD detection in healthcare. The work emphasizes the necessity of domain-specific benchmarks for trustworthy AI in medicine and outlines avenues for extending to segmentation tasks and broader evaluation beyond classification.

Abstract

The growing reliance on Artificial Intelligence (AI) in critical domains such as healthcare demands robust mechanisms to ensure the trustworthiness of these systems, especially when faced with unexpected or anomalous inputs. This paper introduces the Open Medical Imaging Benchmarks for Out-Of-Distribution Detection (OpenMIBOOD), a comprehensive framework for evaluating out-of-distribution (OOD) detection methods specifically in medical imaging contexts. OpenMIBOOD includes three benchmarks from diverse medical domains, encompassing 14 datasets divided into covariate-shifted in-distribution, near-OOD, and far-OOD categories. We evaluate 24 post-hoc methods across these benchmarks, providing a standardized reference to advance the development and fair comparison of OOD detection methods. Results reveal that findings from broad-scale OOD benchmarks in natural image domains do not translate to medical applications, underscoring the critical need for such benchmarks in the medical field. By mitigating the risk of exposing AI models to inputs outside their training distribution, OpenMIBOOD aims to support the advancement of reliable and trustworthy AI systems in healthcare. The repository is available at https://github.com/remic-othr/OpenMIBOOD.

OpenMIBOOD: Open Medical Imaging Benchmarks for Out-Of-Distribution Detection

TL;DR

OpenMIBOOD introduces three medical-imaging OOD benchmarks (MIDOG, PHAKIR, OASIS3) comprising 14 datasets to systematically evaluate post-hoc OOD detectors under csID, nOOD, and fOOD conditions. Using an OpenOOD-inspired framework, the study finds that methods trained on natural images do not generalize well to medical data, with feature-space approaches such as MDSEns and ViM outperforming probability-based methods in most cases. The authors provide standardized dataset splits, metrics (AUROC, FPR@95, AUPRIN/AUPROUT, harmonic mean), and a public codebase, revealing dataset-specific challenges that limit OOD detection in healthcare. The work emphasizes the necessity of domain-specific benchmarks for trustworthy AI in medicine and outlines avenues for extending to segmentation tasks and broader evaluation beyond classification.

Abstract

The growing reliance on Artificial Intelligence (AI) in critical domains such as healthcare demands robust mechanisms to ensure the trustworthiness of these systems, especially when faced with unexpected or anomalous inputs. This paper introduces the Open Medical Imaging Benchmarks for Out-Of-Distribution Detection (OpenMIBOOD), a comprehensive framework for evaluating out-of-distribution (OOD) detection methods specifically in medical imaging contexts. OpenMIBOOD includes three benchmarks from diverse medical domains, encompassing 14 datasets divided into covariate-shifted in-distribution, near-OOD, and far-OOD categories. We evaluate 24 post-hoc methods across these benchmarks, providing a standardized reference to advance the development and fair comparison of OOD detection methods. Results reveal that findings from broad-scale OOD benchmarks in natural image domains do not translate to medical applications, underscoring the critical need for such benchmarks in the medical field. By mitigating the risk of exposing AI models to inputs outside their training distribution, OpenMIBOOD aims to support the advancement of reliable and trustworthy AI systems in healthcare. The repository is available at https://github.com/remic-othr/OpenMIBOOD.

Paper Structure

This paper contains 51 sections, 1 equation, 9 figures, 31 tables.

Figures (9)

  • Figure 1: Illustration of all utilized datasets, categorized by varying degrees of domain shift, from csID to nOOD and fOOD, presented from left to right: MIDOG: All distinct MIDOG domains aubreville2023comprehensive, CCAgT amorim2020novelatkinson_amorim_ccagt_2022, FNAC 2019 saikia2019comparative. PHAKIR: PHAKIR frames with smoke rueckert2024miccai, cholec twinanda2016endonet, EndoSeg15 bodenstedt2018comparative, EndoSeg18 allan20202018, KVASIR jha2020kvasirpogorelov2017kvasir, CATARACTS al2019cataracts. OASIS3: T2w modality and distinct scanner from OASIS3 lamontagne2019oasis, ATLAS liew2022large, BRATS menze2014multimodal, CT from OASIS3 lamontagne2019oasis, MSD-H tobon2015benchmark, CHAOS CHAOSdata2019. On top of each group the primary domain shift is presented. A ‘+’ indicates that the previous domain shift is also included.
  • Figure 2: Distribution of OOD scores for the top four methods on challenging datasets from the MIDOG and PHAKIR benchmarks, including AUROC values for each dataset and method.
  • Figure 3: Average ranking of each method based on AUROC across the MIB (y-axis) and on the IN1k benchmark (x-axis).
  • Figure 4: Distribution of OOD scores for the top four methods on two nOOD datasets from the OASIS3 benchmark, including AUROC values for each dataset and method.
  • Figure 5: Image from the csID Medium Smoke dataset. Classification attribution is visualized in turquoise using Integrated Gradients (Sundarajan et al. sundararajan2017axiomatic), revealing the PHAKIR classifier's tendency to base decisions on regions containing instruments. Arrows indicate an area with localized smoke.
  • ...and 4 more figures