A direct proof of a unified law of robustness for Bregman divergence losses

Santanu Das; Jatin Batra; Piyush Srivastava

A direct proof of a unified law of robustness for Bregman divergence losses

Santanu Das, Jatin Batra, Piyush Srivastava

TL;DR

This paper generalizes the robustness-of-interpolation result of Bubeck and Sellke from square loss to the broader class of Bregman divergence losses, including vector-valued responses. By reframing the proof as a bias–variance decomposition centered on the conditional expectation ${ m E}[Y|X]$, the authors derive a vector-valued law of robustness that avoids Rademacher contraction arguments and works under mixtures of isoperimetric covariate distributions. The main theorem shows that any Lipschitz-interpolating function with bounded parameterization must have a Lipschitz constant that grows with the sample size and dimensional parameters, enforcing a capacity-robustness trade-off in high-dimensional settings. The results specialize to squared loss and cross-entropy, with explicit bounds, and are extended to mixture models, highlighting broad applicability to practical losses (e.g., cross-entropy) beyond regression. Overall, the work provides a unified, elementary framework for understanding robustness of interpolating models under general loss functions and vector-valued responses, with potential implications for robust generalization in deep learning.

Abstract

In contemporary deep learning practice, models are often trained to near zero loss i.e. to nearly interpolate the training data. However, the number of parameters in the model is usually far more than the number of data points n, the theoretical minimum needed for interpolation: a phenomenon referred to as overparameterization. In an interesting piece of work, Bubeck and Sellke considered a natural notion of interpolation: the model is said to interpolate when the model's training loss goes below the loss of the conditional expectation of the response given the covariate. For this notion of interpolation and for a broad class of covariate distributions (specifically those satisfying a natural notion of concentration of measure), they showed that overparameterization is necessary for robust interpolation i.e. if the interpolating function is required to be Lipschitz. Their main proof technique applies to regression with square loss against a scalar response, but they remark that via a connection to Rademacher complexity and using tools such as the Ledoux-Talagrand contraction inequality, their result can be extended to more general losses, at least in the case of scalar response variables. In this work, we recast the original proof technique of Bubeck and Sellke in terms of a bias-variance type decomposition, and show that this view directly unlocks a generalization to Bregman divergence losses (even for vector-valued responses), without the use of tools such as Rademacher complexity or the Ledoux-Talagrand contraction principle. Bregman divergences are a natural class of losses since for these, the best estimator is the conditional expectation of the response given the covariate, and include other practical losses such as the cross entropy loss. Our work thus gives a more general understanding of the main proof technique of Bubeck and Sellke and demonstrates its broad utility.

A direct proof of a unified law of robustness for Bregman divergence losses

TL;DR

, the authors derive a vector-valued law of robustness that avoids Rademacher contraction arguments and works under mixtures of isoperimetric covariate distributions. The main theorem shows that any Lipschitz-interpolating function with bounded parameterization must have a Lipschitz constant that grows with the sample size and dimensional parameters, enforcing a capacity-robustness trade-off in high-dimensional settings. The results specialize to squared loss and cross-entropy, with explicit bounds, and are extended to mixture models, highlighting broad applicability to practical losses (e.g., cross-entropy) beyond regression. Overall, the work provides a unified, elementary framework for understanding robustness of interpolating models under general loss functions and vector-valued responses, with potential implications for robust generalization in deep learning.

Abstract

Paper Structure (15 sections, 10 theorems, 71 equations)

This paper contains 15 sections, 10 theorems, 71 equations.

Introduction
Our Contribution
Related work
Adversarial robustness experiments and overparameterization
Other theoretical works
Preliminaries
Bregman divergence losses
Realistic function classes
Concentration of measure
Overfitting
Proof of the main theorem
Specializing the result to specific losses
Extension to mixtures
Discussion
Proofs of corollaries of the main theorem

Key Result

Theorem 1.1

Let $\Omega$ be a compact convex subset of ${\mathbb{R}}^K$ for some $K > 0$ and let $\phi: \Omega \rightarrow {\mathbb{R}}$ be a continuously differentiable strictly convex function. Let $D_{\phi}$ denote the corresponding Bregman divergence loss. For $\Delta \subseteq {\mathbb{R}}^{d}$, let $\math Here, the hidden constant factors depend upon the properties of $\phi$ and the Lipschitz parameteri

Theorems & Definitions (26)

Theorem 1.1: Main theorem (informal, see \ref{['maintheorem']})
Definition 2.1: Bregman divergence
Example 2.2
Theorem 2.3: banerjee2005optimality
Definition 2.4: Realistic function class
Example 2.5: Neural networks for regression.
Example 2.6: Neural networks for classification.
Definition 2.8: $c$-isoperimetry
Theorem 3.1
Lemma 3.2: Decomposition
...and 16 more

A direct proof of a unified law of robustness for Bregman divergence losses

TL;DR

Abstract

A direct proof of a unified law of robustness for Bregman divergence losses

Authors

TL;DR

Abstract

Table of Contents

Key Result

Theorems & Definitions (26)