Table of Contents
Fetching ...

Learning with Positive and Imperfect Unlabeled Data

Jane H. Lee, Anay Mehrotra, Manolis Zampetakis

TL;DR

This work introduces Positive and Imperfect Unlabeled (PIU) learning, a generalization of PU learning where unlabeled data can be covariate-shifted or imperfect. It establishes sample- and computation-efficient algorithms by reducing to constrained learning via Pessimistic-ERM rather than naive ERM, enabling robust learning under generalized smoothness between the true and observed data distributions. Key contributions include tight sample complexity for q-1 regimes, a computationally efficient PIU learner leveraging L1-polynomial approximations, and extensions to four applications: smooth positive samples, list-decoding with unlabeled lists, truncated estimation with unknown survival sets, and truncation detection for non-product distributions. The results connect PIU learning to smoothened analysis and truncated statistics, offering new algorithms and insights with practical implications in bioinformatics, medicine, and data integration tasks, while also highlighting open questions on universal rates and class-specific efficiency. Overall, PIU provides a versatile framework and toolkit for learning under imperfect unlabeled data with provable guarantees and broad applicability.

Abstract

We study the problem of learning binary classifiers from positive and unlabeled data when the unlabeled data distribution is shifted, which we call Positive and Imperfect Unlabeled (PIU) Learning. In the absence of covariate shifts, i.e., with perfect unlabeled data, Denis (1998) reduced this problem to learning under Massart noise; however, that reduction fails under even slight shifts. Our main results on PIU learning are the characterizations of the sample complexity of PIU learning and a computationally and sample-efficient algorithm achieving a misclassification error $\varepsilon$. We further show that our results lead to new algorithms for several related problems. 1. Learning from smooth distributions: We give algorithms that learn interesting concept classes from only positive samples under smooth feature distributions, bypassing known existing impossibility results and contributing to recent advances in smoothened learning (Haghtalab et al, J.ACM'24) (Chandrasekaran et al., COLT'24). 2. Learning with a list of unlabeled distributions: We design new algorithms that apply to a broad class of concept classes under the assumption that we are given a list of unlabeled distributions, one of which--unknown to the learner--is $O(1)$-close to the true feature distribution. 3. Estimation in the presence of unknown truncation: We give the first polynomial sample and time algorithm for estimating the parameters of an exponential family distribution from samples truncated to an unknown set approximable by polynomials in $L_1$-norm. This improves the algorithm by Lee et al. (FOCS'24) that requires approximation in $L_2$-norm. 4. Detecting truncation: We present new algorithms for detecting whether given samples have been truncated (or not) for a broad class of non-product distributions, including non-product distributions, improving the algorithm by De et al. (STOC'24).

Learning with Positive and Imperfect Unlabeled Data

TL;DR

This work introduces Positive and Imperfect Unlabeled (PIU) learning, a generalization of PU learning where unlabeled data can be covariate-shifted or imperfect. It establishes sample- and computation-efficient algorithms by reducing to constrained learning via Pessimistic-ERM rather than naive ERM, enabling robust learning under generalized smoothness between the true and observed data distributions. Key contributions include tight sample complexity for q-1 regimes, a computationally efficient PIU learner leveraging L1-polynomial approximations, and extensions to four applications: smooth positive samples, list-decoding with unlabeled lists, truncated estimation with unknown survival sets, and truncation detection for non-product distributions. The results connect PIU learning to smoothened analysis and truncated statistics, offering new algorithms and insights with practical implications in bioinformatics, medicine, and data integration tasks, while also highlighting open questions on universal rates and class-specific efficiency. Overall, PIU provides a versatile framework and toolkit for learning under imperfect unlabeled data with provable guarantees and broad applicability.

Abstract

We study the problem of learning binary classifiers from positive and unlabeled data when the unlabeled data distribution is shifted, which we call Positive and Imperfect Unlabeled (PIU) Learning. In the absence of covariate shifts, i.e., with perfect unlabeled data, Denis (1998) reduced this problem to learning under Massart noise; however, that reduction fails under even slight shifts. Our main results on PIU learning are the characterizations of the sample complexity of PIU learning and a computationally and sample-efficient algorithm achieving a misclassification error . We further show that our results lead to new algorithms for several related problems. 1. Learning from smooth distributions: We give algorithms that learn interesting concept classes from only positive samples under smooth feature distributions, bypassing known existing impossibility results and contributing to recent advances in smoothened learning (Haghtalab et al, J.ACM'24) (Chandrasekaran et al., COLT'24). 2. Learning with a list of unlabeled distributions: We design new algorithms that apply to a broad class of concept classes under the assumption that we are given a list of unlabeled distributions, one of which--unknown to the learner--is -close to the true feature distribution. 3. Estimation in the presence of unknown truncation: We give the first polynomial sample and time algorithm for estimating the parameters of an exponential family distribution from samples truncated to an unknown set approximable by polynomials in -norm. This improves the algorithm by Lee et al. (FOCS'24) that requires approximation in -norm. 4. Detecting truncation: We present new algorithms for detecting whether given samples have been truncated (or not) for a broad class of non-product distributions, including non-product distributions, improving the algorithm by De et al. (STOC'24).

Paper Structure

This paper contains 100 sections, 38 theorems, 162 equations, 4 figures, 5 tables, 5 algorithms.

Key Result

Theorem 1.1

Suppose asmp:smoothness holds. Fix any $\varepsilon,\delta\in (0,1/2)$. There is an algorithm that, given $\varepsilon,\delta$ and $n=\widetilde{O}({{(\varepsilon \sigma)}}^{-2q}\cdot \left(\textrm{\rm VC}{}(\mathdutchcal{H})+\log{1/\delta}\right))$ independent samples from $\euscr{P}^\star$ and $\e

Figures (4)

  • Figure 1: Example of mixture distribution $\euscr{M}$ constructed by the distribution of positive examples $\euscr{P}^\star$ and the "imperfect" distribution of unlabeled examples $\euscr{D}$ (also see \ref{['eq:overview:mixture']}). Here, $\euscr{P}^\star$ is a truncation of a Gaussian distribution to a convex set (an interval, namely, ${H^\star}=[0,1]$) and $\euscr{D}$ is a Gaussian distribution. Nevertheless, $\euscr{M}$ is non-log-concave.
  • Figure 2: This figure illustrates one challenge in learning with PIU samples. Subfigure (a) presents the observed positive and (imperfect) unlabeled samples. Given these examples, it is impossible to distinguish between Scenarios (b) and (c). Hence, in particular, due to imperfections in the unlabeled samples a hypothesis with low VC-dimension (e.g., a single halfspace in (b)) can appear like a high-VC-dimension hypotheis (e.g., intersections of many halfspaces in (c)).
  • Figure 3: Outline of the proof of \ref{['thm:main']}. The proof of \ref{['thm:main']} follows a structure analogous to the proof of the sample complexity of PIU learning (\ref{['thm:sampleComplexity']}): we construct a sequence of instances of Pessimistic-ERM and output the intersection of all the (approximate) solutions $H_1,H_2,\dots$ to the instances created. To approximately solve each instance created, we use an approximation algorithm for Pessimistic-ERM (constructed in \ref{['thm:constReg']}). This is the main new technical ingredient in the proof of \ref{['thm:main']} compared to \ref{['thm:sampleComplexity']}.
  • Figure 4: Outline of the proofs of \ref{['thm:sampleComplexity:bothSided', 'new:thm:perm:robust']}.

Theorems & Definitions (84)

  • Theorem 1.1: Sample Complexity
  • Theorem 1.3: Impossiblity of Learning from Positive Samples
  • Definition 1: List Decoding PIU Model
  • Remark 1.6: Obtaining a List of Distributions From Adversarially Corrupted Data
  • Definition 2: Polynomial Threshold Functions (PTF)
  • Remark 3.1: Extensions to More General Notions of Smoothness
  • Definition 3: Polynomial Approximability
  • Theorem 3.1: Sample Complexity
  • Remark 3.2: Proper vs. Improper Learning
  • Theorem 3.3: Computationally Efficient Algorithm
  • ...and 74 more