Table of Contents
Fetching ...

Optimization-Induced Dynamics of Lipschitz Continuity in Neural Networks

Róisín Luo, James McDermott, Christian Gagné, Qiang Sun, Colm O'Riordan

TL;DR

This work develops a rigorous stochastic-differential-equation framework to model the temporal evolution of neural network Lipschitz continuity during SGD training. It decomposes the dynamics into layer- and network-level drift and diffusion terms driven by gradient flow projections, mini-batch gradient noise, and noise-curvature effects, with a detailed operator-norm perturbation analysis. A practical, low-rank gradient-noise estimator enables scalable computation of these dynamics on modern architectures, and the theory is validated on CIFAR-10/100 across multiple regularizers. The results reveal how initialization, batch size, label noise, and sampling trajectories shape Lipschitz growth, including near-convergence unbounded growth and noise-regularization effects, offering insights for robust, trustworthy deep learning systems.

Abstract

Lipschitz continuity characterizes the worst-case sensitivity of neural networks to small input perturbations; yet its dynamics (i.e. temporal evolution) during training remains under-explored. We present a rigorous mathematical framework to model the temporal evolution of Lipschitz continuity during training with stochastic gradient descent (SGD). This framework leverages a system of stochastic differential equations (SDEs) to capture both deterministic and stochastic forces. Our theoretical analysis identifies three principal factors driving the evolution: (i) the projection of gradient flows, induced by the optimization dynamics, onto the operator-norm Jacobian of parameter matrices; (ii) the projection of gradient noise, arising from the randomness in mini-batch sampling, onto the operator-norm Jacobian; and (iii) the projection of the gradient noise onto the operator-norm Hessian of parameter matrices. Furthermore, our theoretical framework sheds light on such as how noisy supervision, parameter initialization, batch size, and mini-batch sampling trajectories, among other factors, shape the evolution of the Lipschitz continuity of neural networks. Our experimental results demonstrate strong agreement between the theoretical implications and the observed behaviors.

Optimization-Induced Dynamics of Lipschitz Continuity in Neural Networks

TL;DR

This work develops a rigorous stochastic-differential-equation framework to model the temporal evolution of neural network Lipschitz continuity during SGD training. It decomposes the dynamics into layer- and network-level drift and diffusion terms driven by gradient flow projections, mini-batch gradient noise, and noise-curvature effects, with a detailed operator-norm perturbation analysis. A practical, low-rank gradient-noise estimator enables scalable computation of these dynamics on modern architectures, and the theory is validated on CIFAR-10/100 across multiple regularizers. The results reveal how initialization, batch size, label noise, and sampling trajectories shape Lipschitz growth, including near-convergence unbounded growth and noise-regularization effects, offering insights for robust, trustworthy deep learning systems.

Abstract

Lipschitz continuity characterizes the worst-case sensitivity of neural networks to small input perturbations; yet its dynamics (i.e. temporal evolution) during training remains under-explored. We present a rigorous mathematical framework to model the temporal evolution of Lipschitz continuity during training with stochastic gradient descent (SGD). This framework leverages a system of stochastic differential equations (SDEs) to capture both deterministic and stochastic forces. Our theoretical analysis identifies three principal factors driving the evolution: (i) the projection of gradient flows, induced by the optimization dynamics, onto the operator-norm Jacobian of parameter matrices; (ii) the projection of gradient noise, arising from the randomness in mini-batch sampling, onto the operator-norm Jacobian; and (iii) the projection of the gradient noise onto the operator-norm Hessian of parameter matrices. Furthermore, our theoretical framework sheds light on such as how noisy supervision, parameter initialization, batch size, and mini-batch sampling trajectories, among other factors, shape the evolution of the Lipschitz continuity of neural networks. Our experimental results demonstrate strong agreement between the theoretical implications and the observed behaviors.

Paper Structure

This paper contains 28 sections, 14 theorems, 75 equations, 9 figures, 3 tables.

Key Result

Proposition 5

Starting from Equation equ:batch_gradient_noise, the batch gradient noise at time $t$ is estimated by: where $\boldsymbol{\Omega}_{t_i}^{(\ell)}$ is the point-wise gradient fluctuation: and $\boldsymbol{\Omega}_t^{(\ell)}$ is batch-wise gradient fluctuation:

Figures (9)

  • Figure 1: Optimization-induced dynamics. During the training, the network parameters, starting from $\boldsymbol{\theta}_0$, moves towards a solution $\boldsymbol{\theta}_A$ or $\boldsymbol{\theta}_B$ as shown in the loss landscape \ref{['subfig:loss_landscape']}, driven by optimization process. Accordingly, this dynamics, driven by the optimization, induces the evolution of the network Lipschitz continuity, starting from $K_0$ to $K_A$ or $K_B$, as shown in the Lipschitz landscape \ref{['subfig:lipschitz_landscape']}. The trajectories in the loss landscape \ref{['subfig:loss_landscape']} and the Lipschitz landscape \ref{['subfig:lipschitz_landscape']} are visualized on the same parameter space $\alpha O \beta$. The $\alpha$ and $\beta$ are two randomly-chosen orthogonal directions in the parameter space.
  • Figure 2: Numerical validation of our mathematical framework. The theoretical Lipschitz constants computed using our framework closely agree with empirical observations. To validate our framework, we train a five-layer ConvNet on CIFAR-10 and CIFAR-100 across multiple configurations for $30,000$ steps ($200$ epochs). We collect the instance-wise gradients over time for all layers. Using Theorem \ref{['theorem:layer_dynamics']}, Theorem \ref{['theorem:network_dynamics']}, Theorem \ref{['theorem:integral_form_network_dynamics']} and Theorem \ref{['theorem:statistics_of_lipschitz']}, we are able to theoretically compute the predicted Lipschitz continuity. The inset plots zoom in on the first $50$ steps, and demonstrate that the trends of Lipschitz constants do not necessarily grow monotonically. Results with more regularization configurations on CIFAR-10 are provided in Appendix \ref{['sec:full_validation_cifar10']}.
  • Figure 3: Dynamics near convergence. We profile both layer-specific and network-specific dynamics over $344,370$ steps ($1766$ epochs) on CIFAR-10. At the end of training, the final training loss and test loss are $9.75 \times 10^{-3}$ and $2.22$, respectively; the final training accuracy and test accuracy are $0.99996$ and $0.68540$, respectively. To investigate how the variances (i.e. diagonal elements) and covariances (i.e. off‐diagonal elements) of the gradient noise affect the dynamics, the dynamics are computed with respect to variances ($\Sigma_{\mathrm{var}}$) and covariances ($\Sigma_{\mathrm{cov}}$) respectively, using Theorem \ref{['theorem:network_dynamics']}. The results indicate that: (i) the optimization‐trajectory drift plays the primary role in shaping Lipschitz continuity over time; (ii) the covariances dominate the noise contributions; and (iii) the noise-curvature entropy production $\kappa_Z(t)$ remains significant near convergence, leading to a gradual and steady increase in Lipschitz continuity. The inset plot zooms in on $100$ steps at $t=330,000$. The moving averages are computed over a window of $500$ steps.
  • Figure 4: Predicted effect of the trajectory in mini-batch sampling on Lipschitz continuity. The inset plot zooms in on $50$ steps at $t=23000$.
  • Figure 5: Predicted effect of batch size on the variance of Lipschitz continuity.
  • ...and 4 more figures

Theorems & Definitions (20)

  • Definition 1: Globally $K$-Lipschitz Continuous tao2006analysisyosida2012functional
  • Remark 2
  • Definition 4: Vectorized SDE for Continuous-Time SGD
  • Proposition 5: Unbiased Batch Gradient Noise Estimator
  • Proposition 6: Diagonal and Off-Diagonal Elements of Batch Gradient Noise
  • Proposition 7: Square Root Approximation of Covariance Matrix
  • Proposition 8: Lipschitz Continuity Bound in Feed-Forward Network
  • Proposition 9: Operator Norm of Linear Unit
  • Definition 10: Stochastic Dynamical System of Lipschitz Continuity Bound
  • Lemma 12: Operator-Norm Jacobian
  • ...and 10 more