Proper Learnability and the Role of Unlabeled Data
Julian Asilis, Siddartha Devic, Shaddin Dughmi, Vatsal Sharan, Shang-Hua Teng
TL;DR
This work investigates when proper multiclass learning is possible and demonstrates that, under a distribution-fixed PAC model where the unlabeled data marginal is known, there exists an optimal proper learner governed by distributional regularization, with learnability matching classical PAC up to logarithmic factors in sample complexity. It establishes a robust equivalence between distribution-fixed and standard PAC in terms of learnability (and a 2-approximation in expected error) and shows that proper learning can be witnessed by a distributional SRM via Bayesian arguments. However, the paper also proves strong impossibility results: proper learnability can be logically undecidable, is not monotone, and is not a local property, and it embeds EMX undecidability into multiclass learning, suggesting that simple dimension-based characterizations are unlikely. Collectively, these findings reveal both a constructive pathway to proper learning under favorable information and deep fundamental barriers to a universal, simple characterization of proper learnability.
Abstract
Proper learning refers to the setting in which learners must emit predictors in the underlying hypothesis class $H$, and often leads to learners with simple algorithmic forms (e.g. empirical risk minimization (ERM), structural risk minimization (SRM)). The limitation of proper learning, however, is that there exist problems which can only be learned improperly, e.g. in multiclass classification. Thus, we ask: Under what assumptions on the hypothesis class or the information provided to the learner is a problem properly learnable? We first demonstrate that when the unlabeled data distribution is given, there always exists an optimal proper learner governed by distributional regularization, a randomized generalization of regularization. We refer to this setting as the distribution-fixed PAC model, and continue to evaluate the learner on its worst-case performance over all distributions. Our result holds for all metric loss functions and any finite learning problem (with no dependence on its size). Further, we demonstrate that sample complexities in the distribution-fixed PAC model can shrink by only a logarithmic factor from the classic PAC model, strongly refuting the role of unlabeled data in PAC learning (from a worst-case perspective). We complement this with impossibility results which obstruct any characterization of proper learnability in the realizable PAC model. First, we observe that there are problems whose proper learnability is logically undecidable, i.e., independent of the ZFC axioms. We then show that proper learnability is not a monotone property of the underlying hypothesis class, and that it is not a local property (in a precise sense). Our impossibility results all hold even for the fundamental setting of multiclass classification, and go through a reduction of EMX learning (Ben-David et al., 2019) to proper classification which may be of independent interest.
