An Upper Bound for the Distribution Overlap Index and Its Applications

Hao Fu; Prashanth Krishnamurthy; Siddharth Garg; Farshad Khorrami

An Upper Bound for the Distribution Overlap Index and Its Applications

Hao Fu, Prashanth Krishnamurthy, Siddharth Garg, Farshad Khorrami

TL;DR

The paper tackles measuring distribution similarity without modeling assumptions by introducing a distribution-free upper bound for the overlap index $\eta(P,Q)$, computable from finite samples. The bound comprises a mean-difference term $\|\mu_{D^+}-\mu_{D^-}\|$ and a subset variation term $\delta_A$, with a tight form optimized over subsets $A$ and amenable to Monte Carlo estimation. It then demonstrates two key applications: a training-free one-class classifier that outputs a confidence score via $f(x)=ComputeBound(...)$, and a domain-shift analysis theorem that upper-bounds model accuracy under distribution changes. Empirical results across novelty detection, out-of-distribution detection, backdoor detection, and anomaly detection show strong, data-efficient performance and highlight practical advantages in computation and memory usage, supporting broader use of overlap-based metrics.

Abstract

This paper proposes an easy-to-compute upper bound for the overlap index between two probability distributions without requiring any knowledge of the distribution models. The computation of our bound is time-efficient and memory-efficient and only requires finite samples. The proposed bound shows its value in one-class classification and domain shift analysis. Specifically, in one-class classification, we build a novel one-class classifier by converting the bound into a confidence score function. Unlike most one-class classifiers, the training process is not needed for our classifier. Additionally, the experimental results show that our classifier can be accurate with only a small number of in-class samples and outperform many state-of-the-art methods on various datasets in different one-class classification scenarios. In domain shift analysis, we propose a theorem based on our bound. The theorem is useful in detecting the existence of domain shift and inferring data information. The detection and inference processes are both computation-efficient and memory-efficient. Our work shows significant promise toward broadening the applications of overlap-based metrics.

An Upper Bound for the Distribution Overlap Index and Its Applications

TL;DR

The paper tackles measuring distribution similarity without modeling assumptions by introducing a distribution-free upper bound for the overlap index

, computable from finite samples. The bound comprises a mean-difference term

and a subset variation term

, with a tight form optimized over subsets

and amenable to Monte Carlo estimation. It then demonstrates two key applications: a training-free one-class classifier that outputs a confidence score via

, and a domain-shift analysis theorem that upper-bounds model accuracy under distribution changes. Empirical results across novelty detection, out-of-distribution detection, backdoor detection, and anomaly detection show strong, data-efficient performance and highlight practical advantages in computation and memory usage, supporting broader use of overlap-based metrics.

Abstract

Paper Structure (22 sections, 3 theorems, 14 equations, 6 figures, 7 tables, 2 algorithms)

This paper contains 22 sections, 3 theorems, 14 equations, 6 figures, 7 tables, 2 algorithms.

Introduction
Background and Related Works
An Upper Bound for the Overlap Index
Preliminaries
The Upper Bound for the Overlap Index
Approximating the Bound with Finite Samples
Application of Our Bound to One-Class Classification
Problem Formulation for One-Class Classification
A Novel Confidence Score Function
Computation and Space Complexities
Evaluation
Novelty Detection
Out-of-Distribution Detection
Backdoor Detection
Anomaly Detection with Iterative Scores
...and 7 more sections

Key Result

Theorem 1

Without loss of generality, assume $D^+$ and $D^-$ are two probability distributions on a bounded domain $B \subset \mathbb{R}^n$ with defined norm $||\cdot||$In this paper, we use the $L_2$ norm. However, the choice of the norm is not unique and the analysis can be carried out using other norms as where $r_A = \sup_{x\in A} ||x||$ and $r_{A^c} = \sup_{x \in A^c}||x||$, $\mu_{D^+}$ and $\mu_{D^-}

Figures (6)

Figure 1: (a): Overlap of two distributions. (b): One-class classification. (c): Backdoor attack.
Figure 2: Evaluation on 100 small UCI datasets for novelty detection. Details are in Table \ref{['tab:novelty']}.
Figure 3: Performance of our approach with different $k$ when CIFAR-10 is the in-distribution data.
Figure 4: Pictures under "Triggers" are poisoned samples regarding different backdoored attacks. Pictures under "Clean" are clean samples for each dataset.
Figure 5: The actual model accuracy (dot) vs. (\ref{['eq:shift']}) (solid) calculated with $L_1$, $L_2$, and $L_\infty$ norms in input, output, and hidden spaces. x: the ratio of clean samples to the entire testing samples.
...and 1 more figures

Theorems & Definitions (15)

Definition 1: Overlap Index
Definition 2: Total Variation Distance
Definition 3: Variation Distance on Subsets
Remark 1
Theorem 1
proof
Remark 2
Corollary 1
Remark 3
Remark 4
...and 5 more

An Upper Bound for the Distribution Overlap Index and Its Applications

TL;DR

Abstract

An Upper Bound for the Distribution Overlap Index and Its Applications

Authors

TL;DR

Abstract

Table of Contents

Key Result

Figures (6)

Theorems & Definitions (15)