Table of Contents
Fetching ...

Single-sample versus case-control sampling scheme for Positive Unlabeled data: the story of two scenarios

Jan Mielniczuk, Adam Wawrzeńczyk

TL;DR

The paper addresses how positive-unlabeled (PU) learning methods, especially those based on empirical risk minimization (ERM), can fail when applied across different data-labeling schemes. By deriving risk representations under the SCAR assumption for both single-sample (s-s) and case-control (c-c) settings, the authors show that the risk decompositions differ and that using a method tailored to one scenario on data from the other is only valid under rare equalities. They propose a scenario-aware PU method, nnPU_{ss}, and compare it to nnPU_{cc}, providing explicit empirical risk formulations for each setting and demonstrating through experiments on 18 datasets that mis-specified methods can overfit and perform poorly, especially at high label frequencies $c$. The results advocate aligning the learning objective with the data collection scheme and offer practical, code-backed guidance for scenario-aware PU inference with significant implications for PU-based ranking and evaluation in real-world applications. The work underscores the importance of changing the risk definition to reflect how unlabeled data are generated, enabling more robust and interpretable PU classifiers across diverse data collection contexts.

Abstract

In the paper we argue that performance of the classifiers based on Empirical Risk Minimization (ERM) for positive unlabeled data, which are designed for case-control sampling scheme may significantly deteriorate when applied to a single-sample scenario. We reveal why their behavior depends, in all but very specific cases, on the scenario. Also, we introduce a single-sample case analogue of the popular non-negative risk classifier designed for case-control data and compare its performance with the original proposal. We show that the significant differences occur between them, especiall when half or more positive of observations are labeled. The opposite case when ERM minimizer designed for the case-control case is applied for single-sample data is also considered and similar conclusions are drawn. Taking into account difference of scenarios requires a sole, but crucial, change in the definition of the Empirical Risk.

Single-sample versus case-control sampling scheme for Positive Unlabeled data: the story of two scenarios

TL;DR

The paper addresses how positive-unlabeled (PU) learning methods, especially those based on empirical risk minimization (ERM), can fail when applied across different data-labeling schemes. By deriving risk representations under the SCAR assumption for both single-sample (s-s) and case-control (c-c) settings, the authors show that the risk decompositions differ and that using a method tailored to one scenario on data from the other is only valid under rare equalities. They propose a scenario-aware PU method, nnPU_{ss}, and compare it to nnPU_{cc}, providing explicit empirical risk formulations for each setting and demonstrating through experiments on 18 datasets that mis-specified methods can overfit and perform poorly, especially at high label frequencies . The results advocate aligning the learning objective with the data collection scheme and offer practical, code-backed guidance for scenario-aware PU inference with significant implications for PU-based ranking and evaluation in real-world applications. The work underscores the importance of changing the risk definition to reflect how unlabeled data are generated, enabling more robust and interpretable PU classifiers across diverse data collection contexts.

Abstract

In the paper we argue that performance of the classifiers based on Empirical Risk Minimization (ERM) for positive unlabeled data, which are designed for case-control sampling scheme may significantly deteriorate when applied to a single-sample scenario. We reveal why their behavior depends, in all but very specific cases, on the scenario. Also, we introduce a single-sample case analogue of the popular non-negative risk classifier designed for case-control data and compare its performance with the original proposal. We show that the significant differences occur between them, especiall when half or more positive of observations are labeled. The opposite case when ERM minimizer designed for the case-control case is applied for single-sample data is also considered and similar conclusions are drawn. Taking into account difference of scenarios requires a sole, but crucial, change in the definition of the Empirical Risk.
Paper Structure (6 sections, 3 theorems, 18 equations, 4 figures, 3 tables, 1 algorithm)

This paper contains 6 sections, 3 theorems, 18 equations, 4 figures, 3 tables, 1 algorithm.

Key Result

Proposition 1.1

Under SCAR we have for single sample case that $P_{X | S=1} = P_{X | Y=1}$.

Figures (4)

  • Figure 1: Comparison of labeled and unlabeled class density for s-s and c-c data
  • Figure 2: Change of accuracy with label frequency increase for single-sample datasets
  • Figure 3: Test accuracy per epoch, selected single-sample datasets, $c = 0.9$
  • Figure 4: Risk components per epoch, Snacks dataset, $c = 0.9$. ,,Method" values refer to risk values obtained during training, whereas ,,Correct" values -- to the ones which would be obtained in the given epoch if scenario-aware risk would be applied.

Theorems & Definitions (6)

  • Proposition 1.1
  • Proposition 1.2
  • Remark 1.3
  • Proposition 2.1
  • Remark 2.2
  • Remark 2.3