An Upper Bound for the Distribution Overlap Index and Its Applications
Hao Fu, Prashanth Krishnamurthy, Siddharth Garg, Farshad Khorrami
TL;DR
The paper tackles measuring distribution similarity without modeling assumptions by introducing a distribution-free upper bound for the overlap index $\eta(P,Q)$, computable from finite samples. The bound comprises a mean-difference term $\|\mu_{D^+}-\mu_{D^-}\|$ and a subset variation term $\delta_A$, with a tight form optimized over subsets $A$ and amenable to Monte Carlo estimation. It then demonstrates two key applications: a training-free one-class classifier that outputs a confidence score via $f(x)=ComputeBound(...)$, and a domain-shift analysis theorem that upper-bounds model accuracy under distribution changes. Empirical results across novelty detection, out-of-distribution detection, backdoor detection, and anomaly detection show strong, data-efficient performance and highlight practical advantages in computation and memory usage, supporting broader use of overlap-based metrics.
Abstract
This paper proposes an easy-to-compute upper bound for the overlap index between two probability distributions without requiring any knowledge of the distribution models. The computation of our bound is time-efficient and memory-efficient and only requires finite samples. The proposed bound shows its value in one-class classification and domain shift analysis. Specifically, in one-class classification, we build a novel one-class classifier by converting the bound into a confidence score function. Unlike most one-class classifiers, the training process is not needed for our classifier. Additionally, the experimental results show that our classifier can be accurate with only a small number of in-class samples and outperform many state-of-the-art methods on various datasets in different one-class classification scenarios. In domain shift analysis, we propose a theorem based on our bound. The theorem is useful in detecting the existence of domain shift and inferring data information. The detection and inference processes are both computation-efficient and memory-efficient. Our work shows significant promise toward broadening the applications of overlap-based metrics.
