Table of Contents
Fetching ...

Evaluating Reliability in Medical DNNs: A Critical Analysis of Feature and Confidence-Based OOD Detection

Harry Anthony, Konstantinos Kamnitsas

TL;DR

It is argued that a combination of feature-based and confidence-based methods should be used within DNN pipelines to mitigate their respective weaknesses.

Abstract

Reliable use of deep neural networks (DNNs) for medical image analysis requires methods to identify inputs that differ significantly from the training data, called out-of-distribution (OOD), to prevent erroneous predictions. OOD detection methods can be categorised as either confidence-based (using the model's output layer for OOD detection) or feature-based (not using the output layer). We created two new OOD benchmarks by dividing the D7P (dermatology) and BreastMNIST (ultrasound) datasets into subsets which either contain or don't contain an artefact (rulers or annotations respectively). Models were trained with artefact-free images, and images with the artefacts were used as OOD test sets. For each OOD image, we created a counterfactual by manually removing the artefact via image processing, to assess the artefact's impact on the model's predictions. We show that OOD artefacts can boost a model's softmax confidence in its predictions, due to correlations in training data among other factors. This contradicts the common assumption that OOD artefacts should lead to more uncertain outputs, an assumption on which most confidence-based methods rely. We use this to explain why feature-based methods (e.g. Mahalanobis score) typically have greater OOD detection performance than confidence-based methods (e.g. MCP). However, we also show that feature-based methods typically perform worse at distinguishing between inputs that lead to correct and incorrect predictions (for both OOD and ID data). Following from these insights, we argue that a combination of feature-based and confidence-based methods should be used within DNN pipelines to mitigate their respective weaknesses. These project's code and OOD benchmarks are available at: https://github.com/HarryAnthony/Evaluating_OOD_detection.

Evaluating Reliability in Medical DNNs: A Critical Analysis of Feature and Confidence-Based OOD Detection

TL;DR

It is argued that a combination of feature-based and confidence-based methods should be used within DNN pipelines to mitigate their respective weaknesses.

Abstract

Reliable use of deep neural networks (DNNs) for medical image analysis requires methods to identify inputs that differ significantly from the training data, called out-of-distribution (OOD), to prevent erroneous predictions. OOD detection methods can be categorised as either confidence-based (using the model's output layer for OOD detection) or feature-based (not using the output layer). We created two new OOD benchmarks by dividing the D7P (dermatology) and BreastMNIST (ultrasound) datasets into subsets which either contain or don't contain an artefact (rulers or annotations respectively). Models were trained with artefact-free images, and images with the artefacts were used as OOD test sets. For each OOD image, we created a counterfactual by manually removing the artefact via image processing, to assess the artefact's impact on the model's predictions. We show that OOD artefacts can boost a model's softmax confidence in its predictions, due to correlations in training data among other factors. This contradicts the common assumption that OOD artefacts should lead to more uncertain outputs, an assumption on which most confidence-based methods rely. We use this to explain why feature-based methods (e.g. Mahalanobis score) typically have greater OOD detection performance than confidence-based methods (e.g. MCP). However, we also show that feature-based methods typically perform worse at distinguishing between inputs that lead to correct and incorrect predictions (for both OOD and ID data). Following from these insights, we argue that a combination of feature-based and confidence-based methods should be used within DNN pipelines to mitigate their respective weaknesses. These project's code and OOD benchmarks are available at: https://github.com/HarryAnthony/Evaluating_OOD_detection.
Paper Structure (5 sections, 2 equations, 3 figures, 2 tables)

This paper contains 5 sections, 2 equations, 3 figures, 2 tables.

Figures (3)

  • Figure 1: Workflow for categorising data based on the model prediction impact, using intra-image interpolation to create synthetic images without artefacts.
  • Figure 2: Comparison of the model's (VGG16) output and XAI heatmap (LRP) for D7P images with and without artefacts, showing cases where predictions are correct (a) or incorrect (b) only with the artefact. Used to demonstrate that OOD artefacts can lead to high confidence predictions.
  • Figure 3: a) BreastMNIST and b) D7P test sets were analysed using different OOD methods, removing predictions from a VGG16 model below $\lambda_{75-ID}$. The pie charts illustrate the distribution of predictions (see Fig. \ref{['Fig:workflow']}), while the bar charts display the percentage of ID and OOD data remaining after removing predictions below $\lambda_{75-ID}$, compared to the original dataset (i). The figure shows MCP's limitation in removing OOD data (ii) and Mahalanobis score's tendency to reduce prediction accuracy (iii). Combining these methods (iv) yields the most trustworthy predictions, but with a higher dismissal rate.