Table of Contents
Fetching ...

Some Fundamental Aspects about Lipschitz Continuity of Neural Networks

Grigory Khromov, Sidak Pal Singh

TL;DR

The paper investigates the inherent Lipschitz behavior of neural networks beyond traditional global bounds by empirically bounding the true Lipschitz constant with a data-grounded lower bound $C_{mathcal{D}^+}$ and a simple upper bound $C_{ ext{upper}}$. Across architectures from FCNs to ResNet-50 and Vision Transformers on datasets including MNIST and ImageNet, the study shows that the local lower bound tracks the effective Lipschitz more faithfully than the upper bound, while training drives increases in both bounds; a Lipschitz Double Descent is observed with width and label-noise interactions. The work highlights an implicit regularisation effect in over-parameterised networks, whose Lipschitz behaviour interacts nontrivially with label noise, and suggests that effective Lipschitz measures may be more informative for generalisation and robustness analyses than naive upper-bound estimates. Overall, the findings provide a scalable framework and empirical scaffolding to guide theoretical development of Lipschitz-based generalisation and robustness in large neural networks.

Abstract

Lipschitz continuity is a crucial functional property of any predictive model, that naturally governs its robustness, generalisation, as well as adversarial vulnerability. Contrary to other works that focus on obtaining tighter bounds and developing different practical strategies to enforce certain Lipschitz properties, we aim to thoroughly examine and characterise the Lipschitz behaviour of Neural Networks. Thus, we carry out an empirical investigation in a range of different settings (namely, architectures, datasets, label noise, and more) by exhausting the limits of the simplest and the most general lower and upper bounds. As a highlight of this investigation, we showcase a remarkable fidelity of the lower Lipschitz bound, identify a striking Double Descent trend in both upper and lower bounds to the Lipschitz and explain the intriguing effects of label noise on function smoothness and generalisation.

Some Fundamental Aspects about Lipschitz Continuity of Neural Networks

TL;DR

The paper investigates the inherent Lipschitz behavior of neural networks beyond traditional global bounds by empirically bounding the true Lipschitz constant with a data-grounded lower bound and a simple upper bound . Across architectures from FCNs to ResNet-50 and Vision Transformers on datasets including MNIST and ImageNet, the study shows that the local lower bound tracks the effective Lipschitz more faithfully than the upper bound, while training drives increases in both bounds; a Lipschitz Double Descent is observed with width and label-noise interactions. The work highlights an implicit regularisation effect in over-parameterised networks, whose Lipschitz behaviour interacts nontrivially with label noise, and suggests that effective Lipschitz measures may be more informative for generalisation and robustness analyses than naive upper-bound estimates. Overall, the findings provide a scalable framework and empirical scaffolding to guide theoretical development of Lipschitz-based generalisation and robustness in large neural networks.

Abstract

Lipschitz continuity is a crucial functional property of any predictive model, that naturally governs its robustness, generalisation, as well as adversarial vulnerability. Contrary to other works that focus on obtaining tighter bounds and developing different practical strategies to enforce certain Lipschitz properties, we aim to thoroughly examine and characterise the Lipschitz behaviour of Neural Networks. Thus, we carry out an empirical investigation in a range of different settings (namely, architectures, datasets, label noise, and more) by exhausting the limits of the simplest and the most general lower and upper bounds. As a highlight of this investigation, we showcase a remarkable fidelity of the lower Lipschitz bound, identify a striking Double Descent trend in both upper and lower bounds to the Lipschitz and explain the intriguing effects of label noise on function smoothness and generalisation.
Paper Structure (76 sections, 1 theorem, 26 equations, 48 figures, 8 tables)

This paper contains 76 sections, 1 theorem, 26 equations, 48 figures, 8 tables.

Key Result

Proposition 2.1

Let function $f: \mathbb R^d \mapsto \mathbb R^K$, be defined on some domain $dom(f) \subseteq \mathbb R^d$. Let $f$ also be differentiable and $C$-Lipschitz continuous. Then the Lipschitz constant $C$ is given by: $C = \sup_{\mathbf x\in dom(f)}\| \nabla_\mathbf x f \|_{\tilde{\alpha}}\,,$ where $\

Figures (48)

  • Figure 1: Plot of Lipschitz constant bounds by training epoch for FCN ReLU network with hidden layer widths $256$ (left) and $65{,}536$ (right) on MNIST1D. $C_\text{upper}$, $C_\text{lower}$ and $C_\text{avg\_norm}$ are computed on train dataset $S$, whereas $C_{S^*}$ is the local Lipschitz computed on the $S^*$. Relative to initialisation, the lower bound at convergence grows by a factor $63\times$, $40\times$, while the upper bound by $66\times, 10\times$, for the widths $256; 65{,}536$ respectively. Results are averaged over 4 runs. See Appendix \ref{['setup-convex-combinations']}.
  • Figure 2: Lipschitz constant bounds for ResNet50 on a subset of $200{,}000$ samples of ImageNet. Results are averaged over 3 runs. More details in Appendix \ref{['setup-evolution-resnet50']}.
  • Figure 3: Lower Lipschitz constant bounds evolution for ViT on a $50{,}000$ samples ImageNet subset. More details in Appendix \ref{['sec:vit-evolution']}.
  • Figure 4: Distribution of the per-sample Jacobian norms for ResNet18, computed on the entire ImageNet and 1,000,000 hard convex combinations.
  • Figure 5: Plot of function prediction (left) and the local Lipschitz constant bounds (right) for the whole input domain $\mathcal{D} = [-5,5]^2$. More details in Appendix \ref{['setup-visual-example']}.
  • ...and 43 more figures

Theorems & Definitions (2)

  • Definition 2.1: Lipschitz continuous function
  • Proposition 2.1: Alternative definition GeometricMeasureTheoryDBLP:journals/corr/abs-2004-08688