Table of Contents
Fetching ...

The Invisible Gorilla Effect in Out-of-distribution Detection

Harry Anthony, Ziyun Liang, Hermione Warr, Konstantinos Kamnitsas

TL;DR

This work identifies the Invisible Gorilla Effect in out-of-distribution detection, showing that OOD artefacts visually similar to a model's region of interest are more readily detected than dissimilar ones, particularly for near-OOD cases. It performs a large-scale evaluation of 40 OOD detectors across 7 benchmarks and 3 architectures, using colour-annotated artefacts and colour-swapped counterfactuals to rule out dataset bias, and derives a mechanistic explanation via a PCA-based nuisance subspace that aligns colour variation with high-variance latent directions. The study finds that feature-based OOD methods are more susceptible to this bias than confidence-based ones, and demonstrates that projecting features orthogonally to the nuisance subspace can substantially mitigate the effect with modest latency overhead. Together, the results provide principled guidance for designing more robust OOD detectors and offer practical mitigation workarounds and publicly available annotations to advance robust deployment in healthcare and industrial settings.

Abstract

Deep Neural Networks achieve high performance in vision tasks by learning features from regions of interest (ROI) within images, but their performance degrades when deployed on out-of-distribution (OOD) data that differs from training data. This challenge has led to OOD detection methods that aim to identify and reject unreliable predictions. Although prior work shows that OOD detection performance varies by artefact type, the underlying causes remain underexplored. To this end, we identify a previously unreported bias in OOD detection: for hard-to-detect artefacts (near-OOD), detection performance typically improves when the artefact shares visual similarity (e.g. colour) with the model's ROI and drops when it does not - a phenomenon we term the Invisible Gorilla Effect. For example, in a skin lesion classifier with red lesion ROI, we show the method Mahalanobis Score achieves a 31.5% higher AUROC when detecting OOD red ink (similar to ROI) compared to black ink (dissimilar) annotations. We annotated artefacts by colour in 11,355 images from three public datasets (e.g. ISIC) and generated colour-swapped counterfactuals to rule out dataset bias. We then evaluated 40 OOD methods across 7 benchmarks and found significant performance drops for most methods when artefacts differed from the ROI. Our findings highlight an overlooked failure mode in OOD detection and provide guidance for more robust detectors. Code and annotations are available at: https://github.com/HarryAnthony/Invisible_Gorilla_Effect.

The Invisible Gorilla Effect in Out-of-distribution Detection

TL;DR

This work identifies the Invisible Gorilla Effect in out-of-distribution detection, showing that OOD artefacts visually similar to a model's region of interest are more readily detected than dissimilar ones, particularly for near-OOD cases. It performs a large-scale evaluation of 40 OOD detectors across 7 benchmarks and 3 architectures, using colour-annotated artefacts and colour-swapped counterfactuals to rule out dataset bias, and derives a mechanistic explanation via a PCA-based nuisance subspace that aligns colour variation with high-variance latent directions. The study finds that feature-based OOD methods are more susceptible to this bias than confidence-based ones, and demonstrates that projecting features orthogonally to the nuisance subspace can substantially mitigate the effect with modest latency overhead. Together, the results provide principled guidance for designing more robust OOD detectors and offer practical mitigation workarounds and publicly available annotations to advance robust deployment in healthcare and industrial settings.

Abstract

Deep Neural Networks achieve high performance in vision tasks by learning features from regions of interest (ROI) within images, but their performance degrades when deployed on out-of-distribution (OOD) data that differs from training data. This challenge has led to OOD detection methods that aim to identify and reject unreliable predictions. Although prior work shows that OOD detection performance varies by artefact type, the underlying causes remain underexplored. To this end, we identify a previously unreported bias in OOD detection: for hard-to-detect artefacts (near-OOD), detection performance typically improves when the artefact shares visual similarity (e.g. colour) with the model's ROI and drops when it does not - a phenomenon we term the Invisible Gorilla Effect. For example, in a skin lesion classifier with red lesion ROI, we show the method Mahalanobis Score achieves a 31.5% higher AUROC when detecting OOD red ink (similar to ROI) compared to black ink (dissimilar) annotations. We annotated artefacts by colour in 11,355 images from three public datasets (e.g. ISIC) and generated colour-swapped counterfactuals to rule out dataset bias. We then evaluated 40 OOD methods across 7 benchmarks and found significant performance drops for most methods when artefacts differed from the ROI. Our findings highlight an overlooked failure mode in OOD detection and provide guidance for more robust detectors. Code and annotations are available at: https://github.com/HarryAnthony/Invisible_Gorilla_Effect.
Paper Structure (24 sections, 3 equations, 9 figures, 29 tables)

This paper contains 24 sections, 3 equations, 9 figures, 29 tables.

Figures (9)

  • Figure 1: Invisible Gorilla Effect. (Left) A DNN (e.g. a ResNet-18 skin lesion classifier) is trained for a primary task, where the region of interest (ROI) is the skin lesion, with a mean RGB of $(176,116,77)$ across the dataset. Once trained, the model is deployed on held-out data, including OOD data containing an unseen ink artefact. (Centre) Visualisation of the OOD detection methods analysed in this study, categorised as Internal Post-hoc, Internal Ad-hoc and External. Circle area reflects the number of method hyperparameters evaluated. (Right) We evaluated OOD detection AUROC on artefacts with colours similar (red) and dissimilar (green, purple, black) to the model’s ROI. A statistically significant AUROC drop ($p<10^{-5}$, Wilcoxon signed-rank) is observed for Mahalanobis Score and RealNVP methods on dissimilar-colour artefacts - an instance of the Invisible Gorilla Effect. Error bars denote 95% confidence intervals over 25 random seeds.
  • Figure 2: (Left) Visualisation of annotated images from MVTec and ISIC, where red borders indicate artefacts visually similar to the model’s region of interest (ROI), and blue borders indicate dissimilar artefacts. (Right) Examples of colour-swapped counterfactuals for colour charts for both dissimilar and similar artefacts, showing the original image, artefact mask and the resulting counterfactual image.
  • Figure 3: Panel (a) shows training data with hyperintense hearts, (b) shows counterfactuals with hypointense hearts. Top rows show training examples and middle rows show OOD samples with synthetic squares of varying intensities. Bottom plots display AUROC for Mahalanobis Score across square intensities, with each point averaged over 25 runs on ResNet18 primary models and connected by a cubic spline. Results show that when models are trained on hyperintense hearts, this improves the OOD detection performance for hyperintense artefacts, and conversely for hypointense hearts.
  • Figure 4: a) Classification accuracy drop between ID and OOD datasets versus Mahalanobis OOD AUROC for each colour (marker colour = artefact colour) for the ISIC ink benchmark averaged over 25 ResNet18s, showing a weak correlation ($r=0.39$). b) OOD AUROC drop between similar and dissimilar colours versus OOD AUROC on similar colours across 40 methods (averaged across ISIC benchmarks), showing a weak positive correlation.
  • Figure 5:
  • ...and 4 more figures