Revisiting Agnostic PAC Learning

Steve Hanneke; Kasper Green Larsen; Nikita Zhivotovskiy

Revisiting Agnostic PAC Learning

Steve Hanneke, Kasper Green Larsen, Nikita Zhivotovskiy

TL;DR

This work analyzes agnostic PAC learning with unknown best-in-class error $\tau$ and demonstrates that ERM is suboptimal for proper learners by a factor of $\sqrt{\ln(1/\tau)}$. It introduces DisagreeingExperts, an improper learner that provably achieves $\mathrm{er}_{\mathcal{D}}(h_S) \le \tau + O\left( \sqrt{ \dfrac{\tau(d + \ln(1/\delta))}{n} } \right) + O\left( \dfrac{\ln^{5}(n/d)(d + \ln(1/\delta))}{n} \right)$ for almost the full $\tau$ range, complemented by a lower bound showing that proper learners must incur a $\Omega\left( \sqrt{ \tau d \ln(1/\tau)/n } \right)$ slack. The core idea is a disagreement-based paradigm that recursively trains pairs of near-optimal classifiers and exploits conditional distributions where disagreement occurs, together with refined ERM analysis for near-identical hypotheses. The results highlight a path to near-optimal agnostic learning via improper methods and open questions about extending optimality to all $\tau$, adaptivity to failure probability $\delta$, and computational efficiency.

Abstract

PAC learning, dating back to Valiant'84 and Vapnik and Chervonenkis'64,'74, is a classic model for studying supervised learning. In the agnostic setting, we have access to a hypothesis set $\mathcal{H}$ and a training set of labeled samples $(x_1,y_1),\dots,(x_n,y_n) \in \mathcal{X} \times \{-1,1\}$ drawn i.i.d. from an unknown distribution $\mathcal{D}$. The goal is to produce a classifier $h : \mathcal{X} \to \{-1,1\}$ that is competitive with the hypothesis $h^\star_{\mathcal{D}} \in \mathcal{H}$ having the least probability of mispredicting the label $y$ of a new sample $(x,y)\sim \mathcal{D}$. Empirical Risk Minimization (ERM) is a natural learning algorithm, where one simply outputs the hypothesis from $\mathcal{H}$ making the fewest mistakes on the training data. This simple algorithm is known to have an optimal error in terms of the VC-dimension of $\mathcal{H}$ and the number of samples $n$. In this work, we revisit agnostic PAC learning and first show that ERM is in fact sub-optimal if we treat the performance of the best hypothesis, denoted $τ:=\Pr_{\mathcal{D}}[h^\star_{\mathcal{D}}(x) \neq y]$, as a parameter. Concretely we show that ERM, and any other proper learning algorithm, is sub-optimal by a $\sqrt{\ln(1/τ)}$ factor. We then complement this lower bound with the first learning algorithm achieving an optimal error for nearly the full range of $τ$. Our algorithm introduces several new ideas that we hope may find further applications in learning theory.

Revisiting Agnostic PAC Learning

TL;DR

This work analyzes agnostic PAC learning with unknown best-in-class error

and demonstrates that ERM is suboptimal for proper learners by a factor of

. It introduces DisagreeingExperts, an improper learner that provably achieves

for almost the full

range, complemented by a lower bound showing that proper learners must incur a

slack. The core idea is a disagreement-based paradigm that recursively trains pairs of near-optimal classifiers and exploits conditional distributions where disagreement occurs, together with refined ERM analysis for near-identical hypotheses. The results highlight a path to near-optimal agnostic learning via improper methods and open questions about extending optimality to all

, adaptivity to failure probability

, and computational efficiency.

Abstract

PAC learning, dating back to Valiant'84 and Vapnik and Chervonenkis'64,'74, is a classic model for studying supervised learning. In the agnostic setting, we have access to a hypothesis set

and a training set of labeled samples

drawn i.i.d. from an unknown distribution

. The goal is to produce a classifier

that is competitive with the hypothesis

having the least probability of mispredicting the label

of a new sample

. Empirical Risk Minimization (ERM) is a natural learning algorithm, where one simply outputs the hypothesis from

making the fewest mistakes on the training data. This simple algorithm is known to have an optimal error in terms of the VC-dimension of

and the number of samples

. In this work, we revisit agnostic PAC learning and first show that ERM is in fact sub-optimal if we treat the performance of the best hypothesis, denoted

, as a parameter. Concretely we show that ERM, and any other proper learning algorithm, is sub-optimal by a

factor. We then complement this lower bound with the first learning algorithm achieving an optimal error for nearly the full range of

. Our algorithm introduces several new ideas that we hope may find further applications in learning theory.

Paper Structure (27 sections, 10 theorems, 54 equations, 2 algorithms)

This paper contains 27 sections, 10 theorems, 54 equations, 2 algorithms.

Introduction
Realizable setting.
Agnostic setting.
Our Contributions.
Proof Overview
New algorithm.
Lower bound for proper learners.
Near-Optimal Agnostic PAC Learner
Simplifying assumptions.
Core algorithm
Brief overview.
Analysis.
Progress on termination (proof of Lemma \ref{['lem:goodfor']})
Failure events.
Termination in Step 11.
...and 12 more sections

Key Result

Theorem 1

For any input domain $\mathcal{X}$, hypothesis set $\mathcal{H}$ of VC-dimension $d$, number of samples $n$, distribution $\mathcal{D}$ over $\mathcal{X} \times \{-1,1\}$ and any $0 < \delta < 1$, it holds with probability at least $1-\delta$ over a sample $\mathbf{S} \sim \mathcal{D}^n$ that every In particular, this implies that running ERM returns a hypothesis $h_\mathbf{S} \in \mathcal{H}$ sa

Theorems & Definitions (18)

Theorem 1: ERM Theorem, derived from lls
Theorem 2
Theorem 3
Lemma 1
Lemma 2
Lemma 3
proof : Proof of Lemma \ref{['lem:goodfor']}
Lemma 4
proof : Proof of Lemma \ref{['lem:progress']}
proof : Proof of Lemma \ref{['lem:betteruni']}
...and 8 more

Revisiting Agnostic PAC Learning

TL;DR

Abstract

Revisiting Agnostic PAC Learning

Authors

TL;DR

Abstract

Table of Contents

Key Result

Theorems & Definitions (18)