Revisiting Agnostic PAC Learning
Steve Hanneke, Kasper Green Larsen, Nikita Zhivotovskiy
TL;DR
This work analyzes agnostic PAC learning with unknown best-in-class error $\tau$ and demonstrates that ERM is suboptimal for proper learners by a factor of $\sqrt{\ln(1/\tau)}$. It introduces DisagreeingExperts, an improper learner that provably achieves $\mathrm{er}_{\mathcal{D}}(h_S) \le \tau + O\left( \sqrt{ \dfrac{\tau(d + \ln(1/\delta))}{n} } \right) + O\left( \dfrac{\ln^{5}(n/d)(d + \ln(1/\delta))}{n} \right)$ for almost the full $\tau$ range, complemented by a lower bound showing that proper learners must incur a $\Omega\left( \sqrt{ \tau d \ln(1/\tau)/n } \right)$ slack. The core idea is a disagreement-based paradigm that recursively trains pairs of near-optimal classifiers and exploits conditional distributions where disagreement occurs, together with refined ERM analysis for near-identical hypotheses. The results highlight a path to near-optimal agnostic learning via improper methods and open questions about extending optimality to all $\tau$, adaptivity to failure probability $\delta$, and computational efficiency.
Abstract
PAC learning, dating back to Valiant'84 and Vapnik and Chervonenkis'64,'74, is a classic model for studying supervised learning. In the agnostic setting, we have access to a hypothesis set $\mathcal{H}$ and a training set of labeled samples $(x_1,y_1),\dots,(x_n,y_n) \in \mathcal{X} \times \{-1,1\}$ drawn i.i.d. from an unknown distribution $\mathcal{D}$. The goal is to produce a classifier $h : \mathcal{X} \to \{-1,1\}$ that is competitive with the hypothesis $h^\star_{\mathcal{D}} \in \mathcal{H}$ having the least probability of mispredicting the label $y$ of a new sample $(x,y)\sim \mathcal{D}$. Empirical Risk Minimization (ERM) is a natural learning algorithm, where one simply outputs the hypothesis from $\mathcal{H}$ making the fewest mistakes on the training data. This simple algorithm is known to have an optimal error in terms of the VC-dimension of $\mathcal{H}$ and the number of samples $n$. In this work, we revisit agnostic PAC learning and first show that ERM is in fact sub-optimal if we treat the performance of the best hypothesis, denoted $τ:=\Pr_{\mathcal{D}}[h^\star_{\mathcal{D}}(x) \neq y]$, as a parameter. Concretely we show that ERM, and any other proper learning algorithm, is sub-optimal by a $\sqrt{\ln(1/τ)}$ factor. We then complement this lower bound with the first learning algorithm achieving an optimal error for nearly the full range of $τ$. Our algorithm introduces several new ideas that we hope may find further applications in learning theory.
