Table of Contents
Fetching ...

Quantifying Overfitting along the Regularization Path for Two-Part-Code MDL in Supervised Classification

Xiaohan Zhu, Nathan Srebro

TL;DR

This work analyzes a modified two-part-code MDL rule, MDL$_{\lambda}$, for supervised binary classification under agnostic learning and tracks the entire regularization path as $\lambda$ varies. The authors derive an exact worst-case limiting error function $\ell_{\lambda}(L^*)$, establish finite-sample PAC-Bayes-based upper bounds, and construct matching lower bounds to prove tightness. They show tempered overfitting for $\lambda\ge 1$ (and more nuanced behavior for $\lambda<1$), with consistency recovered when $\lambda$ grows as $\sqrt{m}$ or faster but potential catastrophic under- or over-regularization for other growth rates. The results provide a baseline for the cost of overfitting along MDL’s regularization path and offer guidance on selecting $\lambda$ in practice, with implications for understanding model complexity and generalization in discrete-prior MDL frameworks.

Abstract

We provide a complete characterization of the entire regularization curve of a modified two-part-code Minimum Description Length (MDL) learning rule for binary classification, based on an arbitrary prior or description language. Grunwald and Langford [2004] previously established the lack of asymptotic consistency, from an agnostic PAC (frequentist worst case) perspective, of the MDL rule with a penalty parameter of $λ=1$, suggesting that it underegularizes. Driven by interest in understanding how benign or catastrophic under-regularization and overfitting might be, we obtain a precise quantitative description of the worst case limiting error as a function of the regularization parameter $λ$ and noise level (or approximation error), significantly tightening the analysis of Grunwald and Langford for $λ=1$ and extending it to all other choices of $λ$.

Quantifying Overfitting along the Regularization Path for Two-Part-Code MDL in Supervised Classification

TL;DR

This work analyzes a modified two-part-code MDL rule, MDL, for supervised binary classification under agnostic learning and tracks the entire regularization path as varies. The authors derive an exact worst-case limiting error function , establish finite-sample PAC-Bayes-based upper bounds, and construct matching lower bounds to prove tightness. They show tempered overfitting for (and more nuanced behavior for ), with consistency recovered when grows as or faster but potential catastrophic under- or over-regularization for other growth rates. The results provide a baseline for the cost of overfitting along MDL’s regularization path and offer guidance on selecting in practice, with implications for understanding model complexity and generalization in discrete-prior MDL frameworks.

Abstract

We provide a complete characterization of the entire regularization curve of a modified two-part-code Minimum Description Length (MDL) learning rule for binary classification, based on an arbitrary prior or description language. Grunwald and Langford [2004] previously established the lack of asymptotic consistency, from an agnostic PAC (frequentist worst case) perspective, of the MDL rule with a penalty parameter of , suggesting that it underegularizes. Driven by interest in understanding how benign or catastrophic under-regularization and overfitting might be, we obtain a precise quantitative description of the worst case limiting error as a function of the regularization parameter and noise level (or approximation error), significantly tightening the analysis of Grunwald and Langford for and extending it to all other choices of .

Paper Structure

This paper contains 25 sections, 13 theorems, 83 equations, 4 figures.

Key Result

Theorem 3.1

(1) For any $0<\lambda\leq 1$, any source distribution $D$, any predictor $h^*$, any valid prior $\pi$, and any $m$: (2) For any $\lambda > 1$, any source distribution $D$, any predictor $h^*$, any valid prior $\pi$, and any $m$: Where $O(\cdot)$ only hides an absolute constant, that does not depend on $D, \pi$ or anything else.

Figures (4)

  • Figure 1: Pareto Frontier.
  • Figure 2: Agnostic worst-case limiting error $\ell_\lambda(L^*)$ (see \ref{['cor:finite']} and equation \ref{['ell']}) as a function of the noise level $L^*$, for different $\lambda$. For each noise level $L^*=L(h^*)$, the curve indicates the best possible guarantee on the limiting error. As $\lambda\rightarrow\infty$ the tempering curve approaches the diagonal $\ell(L^*)=L^*$, indicating consistency. For $\lambda<\infty$, the curve is strictly above the diagonal, i.e. $\ell(L^*)>L^*$ (for $0<L^*<0.5$), and we do not have consistency. For $\lambda \geq 1$, the curve is always below $0.5$ (the unshaded bottom half of the figure), indicating that for any noise level $L^*<0.5$ overfitting is "tempered" in that the limiting error is better than chance. But for $\lambda<1$, this is only the case for $L^*<L_\textrm{critical}=H^{-1}(\lambda)$, and this critical point is indicated by the blue dots on the curves for $\lambda=0.1,0.5$. For $\lambda=0$ the worst case limiting error is always 1.
  • Figure 3: Agnostic worst-case limiting error $\ell_\lambda(L^*)$ of $\textnormal{MDL}_{\lambda}$ as a function of $\lambda$, at a fixed noise level $L^* = 0.1$. The error curve is a continuous function of $\lambda$ for $0\leq\lambda<\infty$.
  • Figure 4: Comparison to GL, for the case $\lambda=1$. Their lower bound for the limiting error of $\textnormal{MDL}_1$ is in green. Our matching lower and upper bounds are in red. Also shown in blue is their upper bound for the related Bayes predictor (they do not provide an upper bound for $\textnormal{MDL}_1$).

Theorems & Definitions (32)

  • Theorem 3.1: Agnostic Upper Bound
  • Theorem 3.2: Agnostic Lower Bound
  • Corollary 3.2.1
  • Theorem 3.3
  • Theorem 3.4
  • Corollary 3.4.1
  • Theorem 3.5
  • Lemma 5.1
  • proof
  • proof
  • ...and 22 more