Table of Contents
Fetching ...

Revisiting Agnostic PAC Learning

Steve Hanneke, Kasper Green Larsen, Nikita Zhivotovskiy

TL;DR

This work analyzes agnostic PAC learning with unknown best-in-class error $\tau$ and demonstrates that ERM is suboptimal for proper learners by a factor of $\sqrt{\ln(1/\tau)}$. It introduces DisagreeingExperts, an improper learner that provably achieves $\mathrm{er}_{\mathcal{D}}(h_S) \le \tau + O\left( \sqrt{ \dfrac{\tau(d + \ln(1/\delta))}{n} } \right) + O\left( \dfrac{\ln^{5}(n/d)(d + \ln(1/\delta))}{n} \right)$ for almost the full $\tau$ range, complemented by a lower bound showing that proper learners must incur a $\Omega\left( \sqrt{ \tau d \ln(1/\tau)/n } \right)$ slack. The core idea is a disagreement-based paradigm that recursively trains pairs of near-optimal classifiers and exploits conditional distributions where disagreement occurs, together with refined ERM analysis for near-identical hypotheses. The results highlight a path to near-optimal agnostic learning via improper methods and open questions about extending optimality to all $\tau$, adaptivity to failure probability $\delta$, and computational efficiency.

Abstract

PAC learning, dating back to Valiant'84 and Vapnik and Chervonenkis'64,'74, is a classic model for studying supervised learning. In the agnostic setting, we have access to a hypothesis set $\mathcal{H}$ and a training set of labeled samples $(x_1,y_1),\dots,(x_n,y_n) \in \mathcal{X} \times \{-1,1\}$ drawn i.i.d. from an unknown distribution $\mathcal{D}$. The goal is to produce a classifier $h : \mathcal{X} \to \{-1,1\}$ that is competitive with the hypothesis $h^\star_{\mathcal{D}} \in \mathcal{H}$ having the least probability of mispredicting the label $y$ of a new sample $(x,y)\sim \mathcal{D}$. Empirical Risk Minimization (ERM) is a natural learning algorithm, where one simply outputs the hypothesis from $\mathcal{H}$ making the fewest mistakes on the training data. This simple algorithm is known to have an optimal error in terms of the VC-dimension of $\mathcal{H}$ and the number of samples $n$. In this work, we revisit agnostic PAC learning and first show that ERM is in fact sub-optimal if we treat the performance of the best hypothesis, denoted $τ:=\Pr_{\mathcal{D}}[h^\star_{\mathcal{D}}(x) \neq y]$, as a parameter. Concretely we show that ERM, and any other proper learning algorithm, is sub-optimal by a $\sqrt{\ln(1/τ)}$ factor. We then complement this lower bound with the first learning algorithm achieving an optimal error for nearly the full range of $τ$. Our algorithm introduces several new ideas that we hope may find further applications in learning theory.

Revisiting Agnostic PAC Learning

TL;DR

This work analyzes agnostic PAC learning with unknown best-in-class error and demonstrates that ERM is suboptimal for proper learners by a factor of . It introduces DisagreeingExperts, an improper learner that provably achieves for almost the full range, complemented by a lower bound showing that proper learners must incur a slack. The core idea is a disagreement-based paradigm that recursively trains pairs of near-optimal classifiers and exploits conditional distributions where disagreement occurs, together with refined ERM analysis for near-identical hypotheses. The results highlight a path to near-optimal agnostic learning via improper methods and open questions about extending optimality to all , adaptivity to failure probability , and computational efficiency.

Abstract

PAC learning, dating back to Valiant'84 and Vapnik and Chervonenkis'64,'74, is a classic model for studying supervised learning. In the agnostic setting, we have access to a hypothesis set and a training set of labeled samples drawn i.i.d. from an unknown distribution . The goal is to produce a classifier that is competitive with the hypothesis having the least probability of mispredicting the label of a new sample . Empirical Risk Minimization (ERM) is a natural learning algorithm, where one simply outputs the hypothesis from making the fewest mistakes on the training data. This simple algorithm is known to have an optimal error in terms of the VC-dimension of and the number of samples . In this work, we revisit agnostic PAC learning and first show that ERM is in fact sub-optimal if we treat the performance of the best hypothesis, denoted , as a parameter. Concretely we show that ERM, and any other proper learning algorithm, is sub-optimal by a factor. We then complement this lower bound with the first learning algorithm achieving an optimal error for nearly the full range of . Our algorithm introduces several new ideas that we hope may find further applications in learning theory.
Paper Structure (27 sections, 10 theorems, 54 equations, 2 algorithms)

This paper contains 27 sections, 10 theorems, 54 equations, 2 algorithms.

Key Result

Theorem 1

For any input domain $\mathcal{X}$, hypothesis set $\mathcal{H}$ of VC-dimension $d$, number of samples $n$, distribution $\mathcal{D}$ over $\mathcal{X} \times \{-1,1\}$ and any $0 < \delta < 1$, it holds with probability at least $1-\delta$ over a sample $\mathbf{S} \sim \mathcal{D}^n$ that every In particular, this implies that running ERM returns a hypothesis $h_\mathbf{S} \in \mathcal{H}$ sa

Theorems & Definitions (18)

  • Theorem 1: ERM Theorem, derived from lls
  • Theorem 2
  • Theorem 3
  • Lemma 1
  • Lemma 2
  • Lemma 3
  • proof : Proof of Lemma \ref{['lem:goodfor']}
  • Lemma 4
  • proof : Proof of Lemma \ref{['lem:progress']}
  • proof : Proof of Lemma \ref{['lem:betteruni']}
  • ...and 8 more