Table of Contents
Fetching ...

Error Exponent in Agnostic PAC Learning

Adi Hendel, Meir Feder

TL;DR

This paper analyzes agnostic PAC learning for binary classification through the lens of error exponents, aiming to capture the exponential decay rate of the probability that the excess risk exceeds a threshold as the sample size grows. By imposing stability and related regularity assumptions, it derives an improved distribution-dependent bound in which the error probability decays as $\doteq e^{-n\min\{\frac{\delta}{4}, d\}}$ with $d = D_{KL}\infdivx{\Pi}{Q}$, and shows that for small $\delta$ the agnostic exponent can coincide with the realizable exponent. The key insight is a decomposition that separates realizable learning error from agnostic misclassification, along with a KL-divergence based large deviation analysis (via the method of types and Sanov's theorem) over a structured partition of hypothesis regions into GLPs and their dominating sets. This yields a tighter, distribution-dependent guarantee than classical VC-based bounds and has practical implications for understanding knowledge distillation and other nonstandard settings where non-uniform rates prevail. Overall, the work advances theoretical understanding of when agnostic learning can be as fast as realizable learning and provides a framework for deriving sharper, problem-dependent PAC guarantees.

Abstract

Statistical learning theory and the Probably Approximately Correct (PAC) criterion are the common approach to mathematical learning theory. PAC is widely used to analyze learning problems and algorithms, and have been studied thoroughly. Uniform worst case bounds on the convergence rate have been well established using, e.g., VC theory or Radamacher complexity. However, in a typical scenario the performance could be much better. In this paper, we consider PAC learning using a somewhat different tradeoff, the error exponent - a well established analysis method in Information Theory - which describes the exponential behavior of the probability that the risk will exceed a certain threshold as function of the sample size. We focus on binary classification and find, under some stability assumptions, an improved distribution dependent error exponent for a wide range of problems, establishing the exponential behavior of the PAC error probability in agnostic learning. Interestingly, under these assumptions, agnostic learning may have the same error exponent as realizable learning. The error exponent criterion can be applied to analyze knowledge distillation, a problem that so far lacks a theoretical analysis.

Error Exponent in Agnostic PAC Learning

TL;DR

This paper analyzes agnostic PAC learning for binary classification through the lens of error exponents, aiming to capture the exponential decay rate of the probability that the excess risk exceeds a threshold as the sample size grows. By imposing stability and related regularity assumptions, it derives an improved distribution-dependent bound in which the error probability decays as with , and shows that for small the agnostic exponent can coincide with the realizable exponent. The key insight is a decomposition that separates realizable learning error from agnostic misclassification, along with a KL-divergence based large deviation analysis (via the method of types and Sanov's theorem) over a structured partition of hypothesis regions into GLPs and their dominating sets. This yields a tighter, distribution-dependent guarantee than classical VC-based bounds and has practical implications for understanding knowledge distillation and other nonstandard settings where non-uniform rates prevail. Overall, the work advances theoretical understanding of when agnostic learning can be as fast as realizable learning and provides a framework for deriving sharper, problem-dependent PAC guarantees.

Abstract

Statistical learning theory and the Probably Approximately Correct (PAC) criterion are the common approach to mathematical learning theory. PAC is widely used to analyze learning problems and algorithms, and have been studied thoroughly. Uniform worst case bounds on the convergence rate have been well established using, e.g., VC theory or Radamacher complexity. However, in a typical scenario the performance could be much better. In this paper, we consider PAC learning using a somewhat different tradeoff, the error exponent - a well established analysis method in Information Theory - which describes the exponential behavior of the probability that the risk will exceed a certain threshold as function of the sample size. We focus on binary classification and find, under some stability assumptions, an improved distribution dependent error exponent for a wide range of problems, establishing the exponential behavior of the PAC error probability in agnostic learning. Interestingly, under these assumptions, agnostic learning may have the same error exponent as realizable learning. The error exponent criterion can be applied to analyze knowledge distillation, a problem that so far lacks a theoretical analysis.
Paper Structure (18 sections, 11 theorems, 85 equations, 2 figures)

This paper contains 18 sections, 11 theorems, 85 equations, 2 figures.

Key Result

Theorem 4.0.1

Given a hypothesis class $\{f_{\theta},\theta \in \Theta\}$, and ground truth function $g$ with projection $f_{{opt}}$ on the hypothesis class, the following holds under assumptions assumption stable-assumption delta:

Figures (2)

  • Figure 1: Ground truth with stable optimal 2-dimensional linear hypothesis (see definition \ref{['def: linear hypothesis class']}). $x_1$ and $x_2$ are uniformly distributed in $[0,1]$. The optimal hypothesis is achieved by $b_0 =-1.3,\ b_1=1,\ b_2=1$.
  • Figure 2: Empirical error exponent of the second term in theorem \ref{['theorem: realizable+agnostic']}. $\ell=2$, $\delta=0.1$. The empirical exponent (blue) was computed using simulation.

Theorems & Definitions (27)

  • Definition 1: k-boundary hypothesis class
  • Definition 2: linear hypothesis class
  • Definition 3: Generalized Optimum Point
  • Definition 4: $A_{\theta}$ region
  • Definition 5: Dominating region
  • Definition 6: Stability
  • Theorem 4.0.1
  • Theorem 4.0.2
  • Lemma 1
  • Proof 1
  • ...and 17 more