Table of Contents
Fetching ...

Positive Unlabeled Learning Selected Not At Random (PULSNAR): class proportion estimation when the SCAR assumption does not hold

Praveen Kumar, Christophe G. Lambert

TL;DR

Two PU learning algorithms are proposed to estimate $\alpha$, calculate calibrated probabilities for PU instances, and improve classification metrics: PULSCAR (positive unlabeled learning selected completely at random), and PULSNAR (positive unlabeled learning selected not at random).

Abstract

Positive and Unlabeled (PU) learning is a type of semi-supervised binary classification where the machine learning algorithm differentiates between a set of positive instances (labeled) and a set of both positive and negative instances (unlabeled). PU learning has broad applications in settings where confirmed negatives are unavailable or difficult to obtain, and there is value in discovering positives among the unlabeled (e.g., viable drugs among untested compounds). Most PU learning algorithms make the \emph{selected completely at random} (SCAR) assumption, namely that positives are selected independently of their features. However, in many real-world applications, such as healthcare, positives are not SCAR (e.g., severe cases are more likely to be diagnosed), leading to a poor estimate of the proportion, $α$, of positives among unlabeled examples and poor model calibration, resulting in an uncertain decision threshold for selecting positives. PU learning algorithms vary; some estimate only the proportion, $α$, of positives in the unlabeled set, while others calculate the probability that each specific unlabeled instance is positive, and some can do both. We propose two PU learning algorithms to estimate $α$, calculate calibrated probabilities for PU instances, and improve classification metrics: i) PULSCAR (positive unlabeled learning selected completely at random), and ii) PULSNAR (positive unlabeled learning selected not at random). PULSNAR employs a divide-and-conquer approach to cluster SNAR positives into subtypes and estimates $α$ for each subtype by applying PULSCAR to positives from each cluster and all unlabeled. In our experiments, PULSNAR outperformed state-of-the-art approaches on both synthetic and real-world benchmark datasets.

Positive Unlabeled Learning Selected Not At Random (PULSNAR): class proportion estimation when the SCAR assumption does not hold

TL;DR

Two PU learning algorithms are proposed to estimate , calculate calibrated probabilities for PU instances, and improve classification metrics: PULSCAR (positive unlabeled learning selected completely at random), and PULSNAR (positive unlabeled learning selected not at random).

Abstract

Positive and Unlabeled (PU) learning is a type of semi-supervised binary classification where the machine learning algorithm differentiates between a set of positive instances (labeled) and a set of both positive and negative instances (unlabeled). PU learning has broad applications in settings where confirmed negatives are unavailable or difficult to obtain, and there is value in discovering positives among the unlabeled (e.g., viable drugs among untested compounds). Most PU learning algorithms make the \emph{selected completely at random} (SCAR) assumption, namely that positives are selected independently of their features. However, in many real-world applications, such as healthcare, positives are not SCAR (e.g., severe cases are more likely to be diagnosed), leading to a poor estimate of the proportion, , of positives among unlabeled examples and poor model calibration, resulting in an uncertain decision threshold for selecting positives. PU learning algorithms vary; some estimate only the proportion, , of positives in the unlabeled set, while others calculate the probability that each specific unlabeled instance is positive, and some can do both. We propose two PU learning algorithms to estimate , calculate calibrated probabilities for PU instances, and improve classification metrics: i) PULSCAR (positive unlabeled learning selected completely at random), and ii) PULSNAR (positive unlabeled learning selected not at random). PULSNAR employs a divide-and-conquer approach to cluster SNAR positives into subtypes and estimates for each subtype by applying PULSCAR to positives from each cluster and all unlabeled. In our experiments, PULSNAR outperformed state-of-the-art approaches on both synthetic and real-world benchmark datasets.
Paper Structure (7 sections, 7 equations, 16 figures, 2 tables, 5 algorithms)

This paper contains 7 sections, 7 equations, 16 figures, 2 tables, 5 algorithms.

Figures (16)

  • Figure 1: PULSCAR algorithm visual intuition. PULSCAR finds the smallest $\alpha$ such that $f_u(x) - \alpha f_p(x)$ is everywhere positive in [0 … 1]. A) Kernel density estimates for simulated data with $\alpha=10\%$ positives in the unlabeled set -- estimated negative density (blue) nearly equals the ground truth (green). B) Overweighting the positive density by $\alpha=15\%$ results in the estimated negative density (blue), $f_u(x) - \alpha f_p(x)$ dropping below zero. C) Underweighting the positive density by $\alpha=5\%$ results in the estimated negative density (blue) being higher than the ground truth (green). D) Objective function with estimated $\alpha=10.68\%$ selected where the finite-differences estimate of the slope is largest -- very close to ground truth $\alpha=10\%$.
  • Figure 2: Schematic of PULSNAR algorithm. An ML model is trained and tested with 5-fold cross-validation (CV) on all positive and unlabeled examples. The important covariates that the model used are scaled by their importance value. Positives are divided into c clusters using the scaled important covariates. c ML models are trained and tested with 5-fold CV on the records from a cluster and all unlabeled records. We estimate the proportions ($\alpha_1...\alpha_c$) of each subtype of positives in the unlabeled examples using PULSCAR. The sum of those estimates gives the overall fraction of positives in the unlabeled set. P = positive set, U = Unlabeled set.
  • Figure 3: KM1, KM2, TICE, DEDPUL, PULSCAR, and PULSNAR evaluated on SCAR and SNAR synthetic datasets. The bar represents the mean value of the estimated $\alpha$, with 95% confidence intervals for estimated $\alpha$. The best estimators are close to the black bars, representing the true $\alpha$. Bars larger than the black bars represent overestimation, while bars smaller than the black bars represent underestimation.
  • Figure 4: KM1, KM2, TICE, DEDPUL and PULSCAR evaluated on SCAR ML benchmark datasets. The bar represents the mean value of the estimated $\alpha$, with 95% confidence intervals for estimated $\alpha$. KM1 and KM2 failed to execute on the KDD cup and Diabetes datasets. The best estimators are close to the black bars, representing the true $\alpha$. Bars larger than the black bars represent overestimation, while bars smaller than the black bars represent underestimation.
  • Figure 5: KM1, KM2, TICE, DEDPUL, PULSCAR and PULSNAR evaluated on SNAR ML benchmark datasets. The bar represents the mean value of the estimated $\alpha$, with 95% confidence intervals for estimated $\alpha$. As KM1 and KM2 were taking several hours to finish one iteration on the Shuttle dataset, the mean $\alpha$ was computed using 5 iterations, and the standard error was set to 0. The best estimators are close to the black bars, representing the true $\alpha$. Bars larger than the black bars represent overestimation, while bars smaller than the black bars represent underestimation.
  • ...and 11 more figures