Table of Contents
Fetching ...

Testing for Overfitting

James Schmidt

TL;DR

This work addresses the challenge of overfitting in supervised learning by proposing a hypothesis‑testing framework that quantifies generalization using both training and holdout data. It defines the mean overfitting margin $\mu_{\mathcal{H}}$ and the empirical margin $\varepsilon_{\mathcal{H}}(\mathsf{S},\mathsf{S}')$, enabling a statistical test based on concentration bounds when the cost is bounded in $[0,1]$. A key result shows that, given $k$ folds and a holdout size $m'$, the bound $\mathbb{P}_{\mathcal{Z}^{km+m'}}\left( \left| \frac{1}{k}\sum_{j=1}^k \varepsilon_{\mathcal{H}}(\mathsf{S}_j,\mathsf{S}') - \mu_{\mathcal{H}} \right| > \varepsilon \right) \le 4 e^{-k \varepsilon^2 / 2}$ holds under the condition $m' > k + \frac{2 \log(k/\delta)}{\varepsilon^2}$. The framework also discusses distributional shift, the relation to PAC learnability, and practical implications, supplemented by simulations and available code. Overall, the paper offers a concrete, data‑driven mechanism to diagnose overfitting and assess generalization without relying solely on uniform PAC guarantees.

Abstract

High complexity models are notorious in machine learning for overfitting, a phenomenon in which models well represent data but fail to generalize an underlying data generating process. A typical procedure for circumventing overfitting computes empirical risk on a holdout set and halts once (or flags that/when) it begins to increase. Such practice often helps in outputting a well-generalizing model, but justification for why it works is primarily heuristic. We discuss the overfitting problem and explain why standard asymptotic and concentration results do not hold for evaluation with training data. We then proceed to introduce and argue for a hypothesis test by means of which both model performance may be evaluated using training data, and overfitting quantitatively defined and detected. We rely on said concentration bounds which guarantee that empirical means should, with high probability, approximate their true mean to conclude that they should approximate each other. We stipulate conditions under which this test is valid, describe how the test may be used for identifying overfitting, articulate a further nuance according to which distributional shift may be flagged, and highlight an alternative notion of learning which usefully captures generalization in the absence of uniform PAC guarantees.

Testing for Overfitting

TL;DR

This work addresses the challenge of overfitting in supervised learning by proposing a hypothesis‑testing framework that quantifies generalization using both training and holdout data. It defines the mean overfitting margin and the empirical margin , enabling a statistical test based on concentration bounds when the cost is bounded in . A key result shows that, given folds and a holdout size , the bound holds under the condition . The framework also discusses distributional shift, the relation to PAC learnability, and practical implications, supplemented by simulations and available code. Overall, the paper offers a concrete, data‑driven mechanism to diagnose overfitting and assess generalization without relying solely on uniform PAC guarantees.

Abstract

High complexity models are notorious in machine learning for overfitting, a phenomenon in which models well represent data but fail to generalize an underlying data generating process. A typical procedure for circumventing overfitting computes empirical risk on a holdout set and halts once (or flags that/when) it begins to increase. Such practice often helps in outputting a well-generalizing model, but justification for why it works is primarily heuristic. We discuss the overfitting problem and explain why standard asymptotic and concentration results do not hold for evaluation with training data. We then proceed to introduce and argue for a hypothesis test by means of which both model performance may be evaluated using training data, and overfitting quantitatively defined and detected. We rely on said concentration bounds which guarantee that empirical means should, with high probability, approximate their true mean to conclude that they should approximate each other. We stipulate conditions under which this test is valid, describe how the test may be used for identifying overfitting, articulate a further nuance according to which distributional shift may be flagged, and highlight an alternative notion of learning which usefully captures generalization in the absence of uniform PAC guarantees.
Paper Structure (10 sections, 2 theorems, 25 equations, 4 figures)

This paper contains 10 sections, 2 theorems, 25 equations, 4 figures.

Key Result

Proposition 3.1

Suppose that model $\hat{y}_{\mathsf{S}}$$\varepsilon/2$-generalizes (eq:approxEquality). Then

Figures (4)

  • Figure 1: Slicing the Cost Function $c:\mathcal{H}\times(\mathcal{X}\times\mathcal{Y})^\omega\rightarrow\mathbb{R}$
  • Figure 2: Sequences of models $\hat{y}_{\mathsf{S_1}},\hat{y}_{\mathsf{S_2}},\ldots,\hat{y}_{\mathsf{S_m}},\ldots$
  • Figure 3: Geometry for map $e_{(\cdot)}(\cdot):(\mathcal{X}\times\mathcal{Y})^{2\omega}\rightarrow\mathbb{R}$
  • Figure 4: Model $o$ overfits.

Theorems & Definitions (5)

  • Definition 3.1
  • Definition 3.2
  • Proposition 3.1: Test for Overfitting
  • Proposition 3.2
  • proof