Table of Contents
Fetching ...

An Upper Bound for the Distribution Overlap Index and Its Applications

Hao Fu, Prashanth Krishnamurthy, Siddharth Garg, Farshad Khorrami

TL;DR

The paper tackles measuring distribution similarity without modeling assumptions by introducing a distribution-free upper bound for the overlap index $\eta(P,Q)$, computable from finite samples. The bound comprises a mean-difference term $\|\mu_{D^+}-\mu_{D^-}\|$ and a subset variation term $\delta_A$, with a tight form optimized over subsets $A$ and amenable to Monte Carlo estimation. It then demonstrates two key applications: a training-free one-class classifier that outputs a confidence score via $f(x)=ComputeBound(...)$, and a domain-shift analysis theorem that upper-bounds model accuracy under distribution changes. Empirical results across novelty detection, out-of-distribution detection, backdoor detection, and anomaly detection show strong, data-efficient performance and highlight practical advantages in computation and memory usage, supporting broader use of overlap-based metrics.

Abstract

This paper proposes an easy-to-compute upper bound for the overlap index between two probability distributions without requiring any knowledge of the distribution models. The computation of our bound is time-efficient and memory-efficient and only requires finite samples. The proposed bound shows its value in one-class classification and domain shift analysis. Specifically, in one-class classification, we build a novel one-class classifier by converting the bound into a confidence score function. Unlike most one-class classifiers, the training process is not needed for our classifier. Additionally, the experimental results show that our classifier can be accurate with only a small number of in-class samples and outperform many state-of-the-art methods on various datasets in different one-class classification scenarios. In domain shift analysis, we propose a theorem based on our bound. The theorem is useful in detecting the existence of domain shift and inferring data information. The detection and inference processes are both computation-efficient and memory-efficient. Our work shows significant promise toward broadening the applications of overlap-based metrics.

An Upper Bound for the Distribution Overlap Index and Its Applications

TL;DR

The paper tackles measuring distribution similarity without modeling assumptions by introducing a distribution-free upper bound for the overlap index , computable from finite samples. The bound comprises a mean-difference term and a subset variation term , with a tight form optimized over subsets and amenable to Monte Carlo estimation. It then demonstrates two key applications: a training-free one-class classifier that outputs a confidence score via , and a domain-shift analysis theorem that upper-bounds model accuracy under distribution changes. Empirical results across novelty detection, out-of-distribution detection, backdoor detection, and anomaly detection show strong, data-efficient performance and highlight practical advantages in computation and memory usage, supporting broader use of overlap-based metrics.

Abstract

This paper proposes an easy-to-compute upper bound for the overlap index between two probability distributions without requiring any knowledge of the distribution models. The computation of our bound is time-efficient and memory-efficient and only requires finite samples. The proposed bound shows its value in one-class classification and domain shift analysis. Specifically, in one-class classification, we build a novel one-class classifier by converting the bound into a confidence score function. Unlike most one-class classifiers, the training process is not needed for our classifier. Additionally, the experimental results show that our classifier can be accurate with only a small number of in-class samples and outperform many state-of-the-art methods on various datasets in different one-class classification scenarios. In domain shift analysis, we propose a theorem based on our bound. The theorem is useful in detecting the existence of domain shift and inferring data information. The detection and inference processes are both computation-efficient and memory-efficient. Our work shows significant promise toward broadening the applications of overlap-based metrics.
Paper Structure (22 sections, 3 theorems, 14 equations, 6 figures, 7 tables, 2 algorithms)

This paper contains 22 sections, 3 theorems, 14 equations, 6 figures, 7 tables, 2 algorithms.

Key Result

Theorem 1

Without loss of generality, assume $D^+$ and $D^-$ are two probability distributions on a bounded domain $B \subset \mathbb{R}^n$ with defined norm $||\cdot||$In this paper, we use the $L_2$ norm. However, the choice of the norm is not unique and the analysis can be carried out using other norms as where $r_A = \sup_{x\in A} ||x||$ and $r_{A^c} = \sup_{x \in A^c}||x||$, $\mu_{D^+}$ and $\mu_{D^-}

Figures (6)

  • Figure 1: (a): Overlap of two distributions. (b): One-class classification. (c): Backdoor attack.
  • Figure 2: Evaluation on 100 small UCI datasets for novelty detection. Details are in Table \ref{['tab:novelty']}.
  • Figure 3: Performance of our approach with different $k$ when CIFAR-10 is the in-distribution data.
  • Figure 4: Pictures under "Triggers" are poisoned samples regarding different backdoored attacks. Pictures under "Clean" are clean samples for each dataset.
  • Figure 5: The actual model accuracy (dot) vs. (\ref{['eq:shift']}) (solid) calculated with $L_1$, $L_2$, and $L_\infty$ norms in input, output, and hidden spaces. x: the ratio of clean samples to the entire testing samples.
  • ...and 1 more figures

Theorems & Definitions (15)

  • Definition 1: Overlap Index
  • Definition 2: Total Variation Distance
  • Definition 3: Variation Distance on Subsets
  • Remark 1
  • Theorem 1
  • proof
  • Remark 2
  • Corollary 1
  • Remark 3
  • Remark 4
  • ...and 5 more