A direct proof of a unified law of robustness for Bregman divergence losses
Santanu Das, Jatin Batra, Piyush Srivastava
TL;DR
This paper generalizes the robustness-of-interpolation result of Bubeck and Sellke from square loss to the broader class of Bregman divergence losses, including vector-valued responses. By reframing the proof as a bias–variance decomposition centered on the conditional expectation ${ m E}[Y|X]$, the authors derive a vector-valued law of robustness that avoids Rademacher contraction arguments and works under mixtures of isoperimetric covariate distributions. The main theorem shows that any Lipschitz-interpolating function with bounded parameterization must have a Lipschitz constant that grows with the sample size and dimensional parameters, enforcing a capacity-robustness trade-off in high-dimensional settings. The results specialize to squared loss and cross-entropy, with explicit bounds, and are extended to mixture models, highlighting broad applicability to practical losses (e.g., cross-entropy) beyond regression. Overall, the work provides a unified, elementary framework for understanding robustness of interpolating models under general loss functions and vector-valued responses, with potential implications for robust generalization in deep learning.
Abstract
In contemporary deep learning practice, models are often trained to near zero loss i.e. to nearly interpolate the training data. However, the number of parameters in the model is usually far more than the number of data points n, the theoretical minimum needed for interpolation: a phenomenon referred to as overparameterization. In an interesting piece of work, Bubeck and Sellke considered a natural notion of interpolation: the model is said to interpolate when the model's training loss goes below the loss of the conditional expectation of the response given the covariate. For this notion of interpolation and for a broad class of covariate distributions (specifically those satisfying a natural notion of concentration of measure), they showed that overparameterization is necessary for robust interpolation i.e. if the interpolating function is required to be Lipschitz. Their main proof technique applies to regression with square loss against a scalar response, but they remark that via a connection to Rademacher complexity and using tools such as the Ledoux-Talagrand contraction inequality, their result can be extended to more general losses, at least in the case of scalar response variables. In this work, we recast the original proof technique of Bubeck and Sellke in terms of a bias-variance type decomposition, and show that this view directly unlocks a generalization to Bregman divergence losses (even for vector-valued responses), without the use of tools such as Rademacher complexity or the Ledoux-Talagrand contraction principle. Bregman divergences are a natural class of losses since for these, the best estimator is the conditional expectation of the response given the covariate, and include other practical losses such as the cross entropy loss. Our work thus gives a more general understanding of the main proof technique of Bubeck and Sellke and demonstrates its broad utility.
