Table of Contents
Fetching ...

Deep Unsupervised Anomaly Detection in Brain Imaging: Large-Scale Benchmarking and Bias Analysis

Alexander Frotscher, Christian F. Baumgartner, Thomas Wolfers

TL;DR

This work tackles the challenge of unsupervised anomaly detection in brain MRI by delivering a large-scale, multi-center benchmark that spans multiple scanners, diseases, and demographics. It evaluates eight state-of-the-art UAD methods (reconstruction- and feature-based) on T1w and T2w data, emphasizing threshold estimation via a diverse validation set and robust post-processing. Key findings show reconstruction-based, diffusion-inspired approaches excel for large lesions, but are sensitive to domain shifts and bias through age, sex, and scanner effects; feature-based methods are more robust to distribution shifts but underperform on small, subtle lesions. The study argues that data quantity alone is insufficient and highlights a path toward clinical translation through principled deviation metrics, MRI-native pretraining, fairness-aware modeling, and robust domain adaptation.

Abstract

Deep unsupervised anomaly detection in brain magnetic resonance imaging offers a promising route to identify pathological deviations without requiring lesion-specific annotations. Yet, fragmented evaluations, heterogeneous datasets, and inconsistent metrics have hindered progress toward clinical translation. Here, we present a large-scale, multi-center benchmark of deep unsupervised anomaly detection for brain imaging. The training cohort comprised 2,976 T1 and 2,972 T2-weighted scans from healthy individuals across six scanners, with ages ranging from 6 to 89 years. Validation used 92 scans to tune hyperparameters and estimate unbiased thresholds. Testing encompassed 2,221 T1w and 1,262 T2w scans spanning healthy datasets and diverse clinical cohorts. Across all algorithms, the Dice-based segmentation performance varied between 0.03 and 0.65, indicating substantial variability. To assess robustness, we systematically evaluated the impact of different scanners, lesion types and sizes, as well as demographics (age, sex). Reconstruction-based methods, particularly diffusion-inspired approaches, achieved the strongest lesion segmentation performance, while feature-based methods showed greater robustness under distributional shifts. However, systematic biases, such as scanner-related effects, were observed for the majority of algorithms, including that small and low-contrast lesions were missed more often, and that false positives varied with age and sex. Increasing healthy training data yields only modest gains, underscoring that current unsupervised anomaly detection frameworks are limited algorithmically rather than by data availability. Our benchmark establishes a transparent foundation for future research and highlights priorities for clinical translation, including image native pretraining, principled deviation measures, fairness-aware modeling, and robust domain adaptation.

Deep Unsupervised Anomaly Detection in Brain Imaging: Large-Scale Benchmarking and Bias Analysis

TL;DR

This work tackles the challenge of unsupervised anomaly detection in brain MRI by delivering a large-scale, multi-center benchmark that spans multiple scanners, diseases, and demographics. It evaluates eight state-of-the-art UAD methods (reconstruction- and feature-based) on T1w and T2w data, emphasizing threshold estimation via a diverse validation set and robust post-processing. Key findings show reconstruction-based, diffusion-inspired approaches excel for large lesions, but are sensitive to domain shifts and bias through age, sex, and scanner effects; feature-based methods are more robust to distribution shifts but underperform on small, subtle lesions. The study argues that data quantity alone is insufficient and highlights a path toward clinical translation through principled deviation metrics, MRI-native pretraining, fairness-aware modeling, and robust domain adaptation.

Abstract

Deep unsupervised anomaly detection in brain magnetic resonance imaging offers a promising route to identify pathological deviations without requiring lesion-specific annotations. Yet, fragmented evaluations, heterogeneous datasets, and inconsistent metrics have hindered progress toward clinical translation. Here, we present a large-scale, multi-center benchmark of deep unsupervised anomaly detection for brain imaging. The training cohort comprised 2,976 T1 and 2,972 T2-weighted scans from healthy individuals across six scanners, with ages ranging from 6 to 89 years. Validation used 92 scans to tune hyperparameters and estimate unbiased thresholds. Testing encompassed 2,221 T1w and 1,262 T2w scans spanning healthy datasets and diverse clinical cohorts. Across all algorithms, the Dice-based segmentation performance varied between 0.03 and 0.65, indicating substantial variability. To assess robustness, we systematically evaluated the impact of different scanners, lesion types and sizes, as well as demographics (age, sex). Reconstruction-based methods, particularly diffusion-inspired approaches, achieved the strongest lesion segmentation performance, while feature-based methods showed greater robustness under distributional shifts. However, systematic biases, such as scanner-related effects, were observed for the majority of algorithms, including that small and low-contrast lesions were missed more often, and that false positives varied with age and sex. Increasing healthy training data yields only modest gains, underscoring that current unsupervised anomaly detection frameworks are limited algorithmically rather than by data availability. Our benchmark establishes a transparent foundation for future research and highlights priorities for clinical translation, including image native pretraining, principled deviation measures, fairness-aware modeling, and robust domain adaptation.

Paper Structure

This paper contains 30 sections, 10 figures, 18 tables.

Figures (10)

  • Figure 1: Large-scale benchmarking of interdisciplinary state-of-the-art deep unsupervised anomaly detection. a) Number of volumes in the training and test datasets, split by condition, compared to previous benchmarks. b) Age distribution of the training data. c) Example of a reconstruction-based approach. An encoder and decoder are trained exclusively on healthy data, and during testing anomalies are identified by thresholding the reconstruction error with $\bm{\tau}$. d) Example of a feature-based approach. A feature map is created by passing the image through a neural network. These features are either (i) reconstructed using an autoencoder, analogous to panel c, or (ii) compared directly to features from a memory bank of healthy images. Anomalies are flagged when the Euclidean distance to healthy features exceeds $\bm{\tau}$.
  • Figure 2: Multi-site and multi-task evaluation of state-of-the-art unsupervised anomaly detection a) The performance of the algorithms on T1w images were reported using two different thresholds, the optimal threshold, defined as the maximum possible Dice score optimized in the test set, thus potentially susceptible to bias but standard in the field. Second, the estimated threshold optimized on the validation data set, then fixed and performance for that threshold reported on the untouched test set, thus unbiased but not the standard in the field. For the large lesions we did not observe a substantial difference between the thresholding procedures. b) The performance of the algorithms with the optimal and estimated thresholds on structural T2 weighted images. c-d) Example images and labels for each dataset and modality. Taken together, we show heterogeneous performance across lesions and modalities, no single method was superior across tasks. Across all evaluations the best performing methods were Disyre followed by ANDi.
  • Figure 3: False positive rate on healthy brains across all tested algorithms associated with estimated threshold. For some methods we identified high false positive rates and heterogeneous performances across the healthy cohort for T1w images (for T2w see supplement). This shows that the methods were biased for either imaging protocols or demographics and selecting the right decision threshold is critical.
  • Figure 4: Out-of-distribution effects due to scanner differences and influence of lesion load. The ATLAS dataset had been split by the percentiles of the lesion load. Top corresponds to a lesion load above the 75th percentile, upper - above 50th percentile to 75th, middle - above 25th percentile to 50th percentile, lower - below 25th percentile. Columns marked with * correspond to the distributions of Dices scores that are different according to the Mann-Whitney U test after multiple testing correction with the Benjamini–Hochberg adjusted significance level $\alpha=0.05$. Taken together, reconstruction-based approaches are more sensitive to imaging protocols, while feature-based methods are generally robust to this source of variation in the data.
  • Figure 5: Impact of demographics on false positive rates and lesion identification. a) The false positive rate on the HCP Aging dataset for the female and male group. All groups show significant differences based on the Mann-Whitney U test. Significance level $\alpha=0.05$ was used for all tests. b) The false positive rate on the HCP Aging dataset adjusted by age. The Spearman rank correlation test displayed that all algorithms show significant positive correlation indicating that older individuals are more likely to have an anomaly assigned to a voxel that is normal. c) Cohen’s d on the HCP Aging dataset for the groups male and female with both modalities. All algorithms show a positive Cohen’s d, an effect due to higher false positive rates for males. d) The Dice score on the BraTS dataset adjusted by age. Only UniAD (marked with *) shows a significant negative correlation (Spearman rank correlation). Bias for age and sex on healthy volumes generally decreased with increasing performance of the algorithm.
  • ...and 5 more figures