Single-sample versus case-control sampling scheme for Positive Unlabeled data: the story of two scenarios
Jan Mielniczuk, Adam Wawrzeńczyk
TL;DR
The paper addresses how positive-unlabeled (PU) learning methods, especially those based on empirical risk minimization (ERM), can fail when applied across different data-labeling schemes. By deriving risk representations under the SCAR assumption for both single-sample (s-s) and case-control (c-c) settings, the authors show that the risk decompositions differ and that using a method tailored to one scenario on data from the other is only valid under rare equalities. They propose a scenario-aware PU method, nnPU_{ss}, and compare it to nnPU_{cc}, providing explicit empirical risk formulations for each setting and demonstrating through experiments on 18 datasets that mis-specified methods can overfit and perform poorly, especially at high label frequencies $c$. The results advocate aligning the learning objective with the data collection scheme and offer practical, code-backed guidance for scenario-aware PU inference with significant implications for PU-based ranking and evaluation in real-world applications. The work underscores the importance of changing the risk definition to reflect how unlabeled data are generated, enabling more robust and interpretable PU classifiers across diverse data collection contexts.
Abstract
In the paper we argue that performance of the classifiers based on Empirical Risk Minimization (ERM) for positive unlabeled data, which are designed for case-control sampling scheme may significantly deteriorate when applied to a single-sample scenario. We reveal why their behavior depends, in all but very specific cases, on the scenario. Also, we introduce a single-sample case analogue of the popular non-negative risk classifier designed for case-control data and compare its performance with the original proposal. We show that the significant differences occur between them, especiall when half or more positive of observations are labeled. The opposite case when ERM minimizer designed for the case-control case is applied for single-sample data is also considered and similar conclusions are drawn. Taking into account difference of scenarios requires a sole, but crucial, change in the definition of the Empirical Risk.
