Table of Contents
Fetching ...

A Median Perspective on Unlabeled Data for Out-of-Distribution Detection

Momin Abbas, Ali Falahati, Hossein Goli, Mohammad Mohammadi Amiri

TL;DR

Medix presents a median-centric, two-stage framework for out-of-distribution detection that leverages unlabeled in-the-wild data to identify candidate OOD outliers via a gradient-based, element-wise median filter. After extracting these candidates, Medix trains a dedicated OOD detector using InD data plus the filtered outliers, guided by a surrogate loss that preserves InD performance. Theoretical analysis provides two-sided error bounds under a sub-Gaussian gradient assumption and a relaxed non-sub-Gaussian bound, highlighting contamination, concentration, and separation effects that govern robustness. Empirically, Medix achieves superior OOD detection performance across CIFAR-10/100 with wild data, outperforming 20 baselines and even performing well in large-scale unseen OOD settings, while maintaining practical computational efficiency. These results demonstrate the practical viability of median-based filtering for robust open-world OOD detection with unlabeled data.

Abstract

Out-of-distribution (OOD) detection plays a crucial role in ensuring the robustness and reliability of machine learning systems deployed in real-world applications. Recent approaches have explored the use of unlabeled data, showing potential for enhancing OOD detection capabilities. However, effectively utilizing unlabeled in-the-wild data remains challenging due to the mixed nature of both in-distribution (InD) and OOD samples. The lack of a distinct set of OOD samples complicates the task of training an optimal OOD classifier. In this work, we introduce Medix, a novel framework designed to identify potential outliers from unlabeled data using the median operation. We use the median because it provides a stable estimate of the central tendency, as an OOD detection mechanism, due to its robustness against noise and outliers. Using these identified outliers, along with labeled InD data, we train a robust OOD classifier. From a theoretical perspective, we derive error bounds that demonstrate Medix achieves a low error rate. Empirical results further substantiate our claims, as Medix outperforms existing methods across the board in open-world settings, confirming the validity of our theoretical insights.

A Median Perspective on Unlabeled Data for Out-of-Distribution Detection

TL;DR

Medix presents a median-centric, two-stage framework for out-of-distribution detection that leverages unlabeled in-the-wild data to identify candidate OOD outliers via a gradient-based, element-wise median filter. After extracting these candidates, Medix trains a dedicated OOD detector using InD data plus the filtered outliers, guided by a surrogate loss that preserves InD performance. Theoretical analysis provides two-sided error bounds under a sub-Gaussian gradient assumption and a relaxed non-sub-Gaussian bound, highlighting contamination, concentration, and separation effects that govern robustness. Empirically, Medix achieves superior OOD detection performance across CIFAR-10/100 with wild data, outperforming 20 baselines and even performing well in large-scale unseen OOD settings, while maintaining practical computational efficiency. These results demonstrate the practical viability of median-based filtering for robust open-world OOD detection with unlabeled data.

Abstract

Out-of-distribution (OOD) detection plays a crucial role in ensuring the robustness and reliability of machine learning systems deployed in real-world applications. Recent approaches have explored the use of unlabeled data, showing potential for enhancing OOD detection capabilities. However, effectively utilizing unlabeled in-the-wild data remains challenging due to the mixed nature of both in-distribution (InD) and OOD samples. The lack of a distinct set of OOD samples complicates the task of training an optimal OOD classifier. In this work, we introduce Medix, a novel framework designed to identify potential outliers from unlabeled data using the median operation. We use the median because it provides a stable estimate of the central tendency, as an OOD detection mechanism, due to its robustness against noise and outliers. Using these identified outliers, along with labeled InD data, we train a robust OOD classifier. From a theoretical perspective, we derive error bounds that demonstrate Medix achieves a low error rate. Empirical results further substantiate our claims, as Medix outperforms existing methods across the board in open-world settings, confirming the validity of our theoretical insights.

Paper Structure

This paper contains 32 sections, 5 theorems, 103 equations, 4 figures, 7 tables, 1 algorithm.

Key Result

Theorem 4.1

Assume that the gradients of InD points in $\mathcal{S}_{\mathrm{wild}}$ are i.i.d., and each coordinate is sub-Gaussian with variance proxy $\sigma^2$. Let $\epsilon = \sigma \sqrt{2\log(2 d m_{\mathrm{in}})}$, and fix any confidence level $\delta \in (0,1)$. Then, with probability at least $1 - \d

Figures (4)

  • Figure 1: Distance deviation as we increase OOD samples in $\mathcal{S}_{\rm wild}$.
  • Figure 2: Example of Medix applied to unlabeled wild data. (a) Setup of the InD data ${\cal S}_{\text{wild}}^{\text{in}}$ and OOD data ${\cal S}_{\text{wild}}^{\text{out}}$ in the wild, with inliers sampled from three multivariate Gaussian distributions. (b) Outliers $\hat{\mathcal{S}}_{\rm out}$ filtered by Medix (in black), with an error rate of $\hat{\mathcal{S}}_{\rm out}$ containing InD data ${\cal S}_{\text{wild}}^{\text{in}}$ is only 12.5%.
  • Figure 3: Comparison of element-wise median (EWM) and geometric median (GM).
  • Figure 4: Illustration of InD sample gradients exhibiting sub-Gaussian behavior in each coordinate. (left) Histogram of gradient values (CIFAR-100 InD data) showing concentration around the mean with light tails, consistent with sub-Gaussianity. (right) Q-Q plot comparing empirical quantiles of InD gradients against a theoretical Gaussian distribution, confirming alignment with sub-Gaussianity.

Theorems & Definitions (9)

  • Theorem 4.1: Inlier Misclassification Bound
  • Theorem 4.2: Outlier Misclassification Bound
  • Remark 4.3
  • Theorem C.1: Inlier Misclassification Bound
  • proof : Proof
  • Theorem C.2: Outlier Retention Bound under Vector Separation
  • proof
  • Theorem C.3: Inlier Misclassification Bound without Sub-Gaussianity
  • proof