Table of Contents
Fetching ...

Thinking in Groups: Permutation Tests Reveal Near-Out-of-Distribution

Yasith Jayawardana, Dineth Jayakody, Sampath Jayarathna, Dushan N. Wadduwage

TL;DR

This work tackles near-OoD detection in biomedical AI by exploiting within-specimen replication to form homogeneous groups. The authors define homogeneous-OoD (HOoD) and formulate OoD detection as a two-sample exchangeability test across $K$ reference subpopulations, using permutation-based MRPP statistics on latent responses $Z(x;\phi)$. The method outputs per-subpopulation $p$-values and declares InD if $\max_k p_k \ge \alpha$, enabling interpretable, batch-wise OoD assessment without strong distributional assumptions. Empirical results on toy MNIST/CIFAR-10 splits and the AMRB bacteria dataset show that MRPP/LSP-based HOoD outperforms standard point-wise detectors and offers robust near-OoD detection across architectures and datasets, with practical potential for real-world biomedical deployment.

Abstract

Deep neural networks (DNNs) have the potential to power many biomedical workflows, but training them on truly representative, IID datasets is often infeasible. Most models instead rely on biased or incomplete data, making them prone to out-of-distribution (OoD) inputs that closely resemble in-distribution samples. Such near-OoD cases are harder to detect than standard OOD benchmarks and can cause unreliable, even catastrophic, predictions. Biomedical assays, however, offer a unique opportunity: they often generate multiple correlated measurements per specimen through biological or technical replicates. Exploiting this insight, we introduce Homogeneous OoD (HOoD), a novel OoD detection framework for correlated data. HOoD projects groups of correlated measurements through a trained model and uses permutation-based hypothesis tests to compare them with known subpopulations. Each test yields an interpretable p-value, quantifying how well a group matches a subpopulation. By aggregating these p-values, HOoD reliably identifies OoD groups. In evaluations, HOoD consistently outperforms point-wise and ensemble-based OoD detectors, demonstrating its promise for robust real-world deployment.

Thinking in Groups: Permutation Tests Reveal Near-Out-of-Distribution

TL;DR

This work tackles near-OoD detection in biomedical AI by exploiting within-specimen replication to form homogeneous groups. The authors define homogeneous-OoD (HOoD) and formulate OoD detection as a two-sample exchangeability test across reference subpopulations, using permutation-based MRPP statistics on latent responses . The method outputs per-subpopulation -values and declares InD if , enabling interpretable, batch-wise OoD assessment without strong distributional assumptions. Empirical results on toy MNIST/CIFAR-10 splits and the AMRB bacteria dataset show that MRPP/LSP-based HOoD outperforms standard point-wise detectors and offers robust near-OoD detection across architectures and datasets, with practical potential for real-world biomedical deployment.

Abstract

Deep neural networks (DNNs) have the potential to power many biomedical workflows, but training them on truly representative, IID datasets is often infeasible. Most models instead rely on biased or incomplete data, making them prone to out-of-distribution (OoD) inputs that closely resemble in-distribution samples. Such near-OoD cases are harder to detect than standard OOD benchmarks and can cause unreliable, even catastrophic, predictions. Biomedical assays, however, offer a unique opportunity: they often generate multiple correlated measurements per specimen through biological or technical replicates. Exploiting this insight, we introduce Homogeneous OoD (HOoD), a novel OoD detection framework for correlated data. HOoD projects groups of correlated measurements through a trained model and uses permutation-based hypothesis tests to compare them with known subpopulations. Each test yields an interpretable p-value, quantifying how well a group matches a subpopulation. By aggregating these p-values, HOoD reliably identifies OoD groups. In evaluations, HOoD consistently outperforms point-wise and ensemble-based OoD detectors, demonstrating its promise for robust real-world deployment.
Paper Structure (25 sections, 16 equations, 12 figures, 5 tables)

This paper contains 25 sections, 16 equations, 12 figures, 5 tables.

Figures (12)

  • Figure 1: Method Overview: A dataset with training and validation data from classes $\{y_1,\cdots,y_k\}$ with $(k \ll K)$ is used to train a model. Next, for each class $y_k$, a homogeneous sample $S[y_k]$ ($N$ data points) is drawn from validation data and passed through the model to obtain latent responses. At inference time, a new homogeneous sample ($M$ data points) is passed through the model to obtain latent responses, and compared with each $S[y_k]$ using hypothesis tests. Each test yields a statistic $\delta_k$ and significance value $p_k$, which is then vectorized ($\mathbf{\delta}$, $\mathbf{p}$). The distribution generated for each of $k$ comparisons provide insight to the homogeneity and exchangeability of the test sample. Finally $\mathbf{p}$ is mapped into InD/OoD using a decision function $D(\alpha)$, at significance level $\alpha$.
  • Figure 2: MRPP statistic and P-value from $5\times5$ species-level HOoD tests (Left) and $21\times18$ strain-level HOoD tests (Right) performed using logit outputs from 2 ResNet-50 models (AMRB-A, AMRB-B) each trained on disjoint subsets of labels from the AMRB dataset. Each cell represents a HOoD test (Permutations = 3000, Sample Size = 100) between a test sample (row-label) and a reference sample (column-label). High $p$-values expected for HOoD tests on the diagonal (same species/strain).
  • Figure 3: UMAP - Feature Space and Logit Space of ResNet-50 model for AMRB Data. Higher InD-OoD separation is better. Data points are class-wise discriminative when the model is trained on all classes. However, this discriminative property is lost when the model is trained on a subset of classes. In this case, OoD classes fall into the same regions as InD classes, making point-wise OoD detection challenging on both spaces.
  • Figure 4: UMAP - Feature Space and Logit Space of ResNet-CAE model for AMRB Data. Higher InD-OoD separation is better.
  • Figure 5: Uncertainty of Test Data. Higher InD-OoD separation is better. Blue - InD-Correct, Orange - InD-Incorrect, Green - OoD. C1-C3:$(1-\text{MSP})$ ResNet-50, C4-C6:$(1-\text{MSP})$ ResNet-CAE. Here, the uncertainty of OoD data points is distributed across the span of the metric, making it challenging to detect point-wise OoD using uncertainty.(C are columns from left to right)
  • ...and 7 more figures

Theorems & Definitions (1)

  • Definition 1: Homogeneity