Table of Contents
Fetching ...

On Agnostic PAC Learning in the Small Error Regime

Julian Asilis, Mikael Møller Høgsgaard, Grigoris Velegkas

TL;DR

The paper advances the understanding of agnostic PAC learning in the small-error regime by introducing a tau-aware error framework and designing a computationally efficient learner that achieves error at most 2.1 · tau plus standard concentration terms, matching known lower bounds when tau is close to d/m. The approach leverages aggregations of ERM classifiers through careful subsampling and a voting scheme, and then refines the method with a 27-way sample split and a region-of-disagreement tie-breaker to reduce the multiplicative dependence on tau. A key contribution is showing how to integrate this tau-based learner with the prior work of Hanneke, and thereby obtain a best-of-both-worlds guarantee. The results resolve the tau ≈ d/m regime and push forward the broader program of understanding the trade-offs between majority-voting schemes and computationally efficient agnostic learners, while leaving open whether the constant can be driven to 1.

Abstract

Binary classification in the classic PAC model exhibits a curious phenomenon: Empirical Risk Minimization (ERM) learners are suboptimal in the realizable case yet optimal in the agnostic case. Roughly speaking, this owes itself to the fact that non-realizable distributions $\mathcal{D}$ are simply more difficult to learn than realizable distributions -- even when one discounts a learner's error by $\mathrm{err}(h^*_{\mathcal{D}})$, the error of the best hypothesis in $\mathcal{H}$ for $\mathcal{D}$. Thus, optimal agnostic learners are permitted to incur excess error on (easier-to-learn) distributions $\mathcal{D}$ for which $τ= \mathrm{err}(h^*_{\mathcal{D}})$ is small. Recent work of Hanneke, Larsen, and Zhivotovskiy (FOCS `24) addresses this shortcoming by including $τ$ itself as a parameter in the agnostic error term. In this more fine-grained model, they demonstrate tightness of the error lower bound $τ+ Ω\left(\sqrt{\frac{τ(d + \log(1 / δ))}{m}} + \frac{d + \log(1 / δ)}{m} \right)$ in a regime where $τ> d/m$, and leave open the question of whether there may be a higher lower bound when $τ\approx d/m$, with $d$ denoting $\mathrm{VC}(\mathcal{H})$. In this work, we resolve this question by exhibiting a learner which achieves error $c \cdot τ+ O \left(\sqrt{\frac{τ(d + \log(1 / δ))}{m}} + \frac{d + \log(1 / δ)}{m} \right)$ for a constant $c \leq 2.1$, thus matching the lower bound when $τ\approx d/m$. Further, our learner is computationally efficient and is based upon careful aggregations of ERM classifiers, making progress on two other questions of Hanneke, Larsen, and Zhivotovskiy (FOCS `24). We leave open the interesting question of whether our approach can be refined to lower the constant from 2.1 to 1, which would completely settle the complexity of agnostic learning.

On Agnostic PAC Learning in the Small Error Regime

TL;DR

The paper advances the understanding of agnostic PAC learning in the small-error regime by introducing a tau-aware error framework and designing a computationally efficient learner that achieves error at most 2.1 · tau plus standard concentration terms, matching known lower bounds when tau is close to d/m. The approach leverages aggregations of ERM classifiers through careful subsampling and a voting scheme, and then refines the method with a 27-way sample split and a region-of-disagreement tie-breaker to reduce the multiplicative dependence on tau. A key contribution is showing how to integrate this tau-based learner with the prior work of Hanneke, and thereby obtain a best-of-both-worlds guarantee. The results resolve the tau ≈ d/m regime and push forward the broader program of understanding the trade-offs between majority-voting schemes and computationally efficient agnostic learners, while leaving open whether the constant can be driven to 1.

Abstract

Binary classification in the classic PAC model exhibits a curious phenomenon: Empirical Risk Minimization (ERM) learners are suboptimal in the realizable case yet optimal in the agnostic case. Roughly speaking, this owes itself to the fact that non-realizable distributions are simply more difficult to learn than realizable distributions -- even when one discounts a learner's error by , the error of the best hypothesis in for . Thus, optimal agnostic learners are permitted to incur excess error on (easier-to-learn) distributions for which is small. Recent work of Hanneke, Larsen, and Zhivotovskiy (FOCS `24) addresses this shortcoming by including itself as a parameter in the agnostic error term. In this more fine-grained model, they demonstrate tightness of the error lower bound in a regime where , and leave open the question of whether there may be a higher lower bound when , with denoting . In this work, we resolve this question by exhibiting a learner which achieves error for a constant , thus matching the lower bound when . Further, our learner is computationally efficient and is based upon careful aggregations of ERM classifiers, making progress on two other questions of Hanneke, Larsen, and Zhivotovskiy (FOCS `24). We leave open the interesting question of whether our approach can be refined to lower the constant from 2.1 to 1, which would completely settle the complexity of agnostic learning.

Paper Structure

This paper contains 21 sections, 9 theorems, 155 equations, 2 figures, 4 algorithms.

Key Result

Theorem 1.4

For any domain $\mathcal{X},$ hypothesis class $\mathcal{H}$ of VC dimension $d,$ number of samples $m,$ parameter $\delta \in (0,1),$ there is an algorithm such that for any distribution $\mathcal{D}$ over $\mathcal{X} \times \{-1,1\}$ it returns a classifier $h_S: \mathcal{X} \rightarrow \{-1, 1\} where $\tau$ is the error of the best hypothesis in $\mathcal{H}.$

Figures (2)

  • Figure 1: The splitting process of algorithm $\mathcal{S}'$. The active set $S$ is split into three disjoint sets $S_1, S_2, S_3$. Each of these is then split into a new active set (green) and a set of previously recursed-on samples (grey), which are passed down to subsequent recursive calls.
  • Figure 2: A flowchart of the final algorithm. The initial sample $\mathbf{S}$ is split into three parts. $\mathbf{S}_1$ is used to construct our tie-breaking classifier $\tilde{\mathcal{A}}_1$. $\mathbf{S}_2$ is used to train the algorithm of hanneke2024revisiting, yielding $\tilde{\mathcal{A}}_2$. Finally, $\mathbf{S}_3$ is used as a hold-out set to select the better of the two classifiers.

Theorems & Definitions (12)

  • Theorem 1.4
  • Remark 3.1
  • Lemma C.1
  • Theorem C.2
  • proof
  • Lemma C.3: Understandingmachinelearningfromtheory Lemma B.10
  • Theorem C.4
  • Lemma C.5
  • proof
  • Lemma C.6: Understandingmachinelearningfromtheory, Theorem 6.8
  • ...and 2 more