Table of Contents
Fetching ...

PUAL: A Classifier on Trifurcate Positive-Unlabeled Data

Xiaoke Wang, Xiaochen Yang, Rui Zhu, Jing-Hao Xue

TL;DR

This work tackles PU learning under trifurcate data where two positive subgroups lie on opposite sides of negatives. It introduces PUAL, a classifier with asymmetric loss on the labeled-positive and unlabeled data, further extended to non-linear boundaries via kernelization and ADMM-based optimization. The method demonstrates improved boundary fidelity and higher F1 performance over GLLC and other PU approaches on both synthetic trifurcate-like data and 16 real-world UCI datasets, with the kernelized version handling complex separations. The results suggest PUAL’s practical value for PU tasks with multi-modal positive distributions, while future work may leverage L1 regularization to enhance sparsity and feature selection.

Abstract

Positive-unlabeled (PU) learning aims to train a classifier using the data containing only labeled-positive instances and unlabeled instances. However, existing PU learning methods are generally hard to achieve satisfactory performance on trifurcate data, where the positive instances distribute on both sides of the negative instances. To address this issue, firstly we propose a PU classifier with asymmetric loss (PUAL), by introducing a structure of asymmetric loss on positive instances into the objective function of the global and local learning classifier. Then we develop a kernel-based algorithm to enable PUAL to obtain non-linear decision boundary. We show that, through experiments on both simulated and real-world datasets, PUAL can achieve satisfactory classification on trifurcate data.

PUAL: A Classifier on Trifurcate Positive-Unlabeled Data

TL;DR

This work tackles PU learning under trifurcate data where two positive subgroups lie on opposite sides of negatives. It introduces PUAL, a classifier with asymmetric loss on the labeled-positive and unlabeled data, further extended to non-linear boundaries via kernelization and ADMM-based optimization. The method demonstrates improved boundary fidelity and higher F1 performance over GLLC and other PU approaches on both synthetic trifurcate-like data and 16 real-world UCI datasets, with the kernelized version handling complex separations. The results suggest PUAL’s practical value for PU tasks with multi-modal positive distributions, while future work may leverage L1 regularization to enhance sparsity and feature selection.

Abstract

Positive-unlabeled (PU) learning aims to train a classifier using the data containing only labeled-positive instances and unlabeled instances. However, existing PU learning methods are generally hard to achieve satisfactory performance on trifurcate data, where the positive instances distribute on both sides of the negative instances. To address this issue, firstly we propose a PU classifier with asymmetric loss (PUAL), by introducing a structure of asymmetric loss on positive instances into the objective function of the global and local learning classifier. Then we develop a kernel-based algorithm to enable PUAL to obtain non-linear decision boundary. We show that, through experiments on both simulated and real-world datasets, PUAL can achieve satisfactory classification on trifurcate data.
Paper Structure (30 sections, 1 theorem, 31 equations, 6 figures, 2 tables, 2 algorithms)

This paper contains 30 sections, 1 theorem, 31 equations, 6 figures, 2 tables, 2 algorithms.

Key Result

Theorem 1

Let $\phi(\bm{X}),\phi(\bm{Z})$ be a mapping of matrices of $\bm{X}, \bm{Z}$ and $\bm{\kappa}_1(\phi(\bm{X}),\phi(\bm{Z}))$ be a kernel matrix of $\phi(\bm{X})$ and $\phi(\bm{Z})$. Then the following two matrices $\bm{\kappa}_2(\mathbf{\bm{X}}, \mathbf{\bm{Z}})$ and $\bm{\kappa}_3(\mathbf{\bm{X}}, \

Figures (6)

  • Figure 1: The 2-dimensional projection with t-SNE of dataset wifi, where the positive set is roughly constituted by two subsets distributing on both sides of negative instances; the perplexity for the training of t-SNE on wifi was set to 750.
  • Figure 2: A pattern of the linearly separable space constructed from the original trifurcate PU datasets via the kernel trick; $x_1$ and $x_2$ represent the mappings of the features.
  • Figure 3: The decision boundaries trained by (left) PUAL and (right) GLLC on the synthetic data with ${\mathbf{mean}_{p2}}=50, 100, 500$. Pink area: the negative region of PUAL; orange area: the negative region of GLLC; red points: positive instances; blue points: negative instances; the instances in the plots are from the test sets.
  • Figure 4: Boxplots for the difference between F1-scores of PUAL and GLLC on each dataset increasingly ranked by medians; label frequencies $\gamma=0.5$ (top) and $0.25$ (bottom); x-axis: the datasets; y-axis: the difference between PUAL and GLLC in F1-score.
  • Figure 5: The t-SNE plots of the best four datasets wifi, OR1, OR2 and Pen; the perplexity for the training of t-SNE on these four datasets was set to 750, 40, 250, 750, respectively; label frequency $\gamma$=0.25; red: positive instances; blue: negative instances; triangle: labeled instances; circle: unlabeled instances.
  • ...and 1 more figures

Theorems & Definitions (1)

  • Theorem 1