PAC-Chernoff Bounds: Understanding Generalization in the Interpolation Regime
Andrés R. Masegosa, Luis A. Ortega
TL;DR
This work develops a distribution-dependent PAC-Chernoff bound that is perfectly tight for interpolators in over-parameterized regimes, linking generalization to a rate function from Large Deviation Theory. It introduces a notion of model smoothness via the rate function and shows that combining techniques such as $\ell_2$ regularization, distance from initialization, input-gradient regularization, data augmentation, invariant architectures, and over-parameterization yields smoother interpolators with superior generalization. The framework unifies many regularization and architectural approaches under the inverse-rate regularizer, explains the double-descent phenomenon, and provides practical methods to estimate the rate function from data. By tying distributional information to generalization, the results offer a principled explanation for why modern interpolating learners—often massively over-parameterized—can generalize well and how to design learning strategies that drive smoother interpolations with lower test error.
Abstract
This paper introduces a distribution-dependent PAC-Chernoff bound that exhibits perfect tightness for interpolators, even within over-parameterized model classes. This bound, which relies on basic principles of Large Deviation Theory, defines a natural measure of the smoothness of a model, characterized by simple real-valued functions. Building upon this bound and the new concept of smoothness, we present an unified theoretical framework revealing why certain interpolators show an exceptional generalization, while others falter. We theoretically show how a wide spectrum of modern learning methodologies, encompassing techniques such as $\ell_2$-norm, distance-from-initialization and input-gradient regularization, in combination with data augmentation, invariant architectures, and over-parameterization, collectively guide the optimizer toward smoother interpolators, which, according to our theoretical framework, are the ones exhibiting superior generalization performance. This study shows that distribution-dependent bounds serve as a powerful tool to understand the complex dynamics behind the generalization capabilities of over-parameterized interpolators.
