Quantifying Overfitting along the Regularization Path for Two-Part-Code MDL in Supervised Classification
Xiaohan Zhu, Nathan Srebro
TL;DR
This work analyzes a modified two-part-code MDL rule, MDL$_{\lambda}$, for supervised binary classification under agnostic learning and tracks the entire regularization path as $\lambda$ varies. The authors derive an exact worst-case limiting error function $\ell_{\lambda}(L^*)$, establish finite-sample PAC-Bayes-based upper bounds, and construct matching lower bounds to prove tightness. They show tempered overfitting for $\lambda\ge 1$ (and more nuanced behavior for $\lambda<1$), with consistency recovered when $\lambda$ grows as $\sqrt{m}$ or faster but potential catastrophic under- or over-regularization for other growth rates. The results provide a baseline for the cost of overfitting along MDL’s regularization path and offer guidance on selecting $\lambda$ in practice, with implications for understanding model complexity and generalization in discrete-prior MDL frameworks.
Abstract
We provide a complete characterization of the entire regularization curve of a modified two-part-code Minimum Description Length (MDL) learning rule for binary classification, based on an arbitrary prior or description language. Grunwald and Langford [2004] previously established the lack of asymptotic consistency, from an agnostic PAC (frequentist worst case) perspective, of the MDL rule with a penalty parameter of $λ=1$, suggesting that it underegularizes. Driven by interest in understanding how benign or catastrophic under-regularization and overfitting might be, we obtain a precise quantitative description of the worst case limiting error as a function of the regularization parameter $λ$ and noise level (or approximation error), significantly tightening the analysis of Grunwald and Langford for $λ=1$ and extending it to all other choices of $λ$.
