Optimization-Induced Dynamics of Lipschitz Continuity in Neural Networks
Róisín Luo, James McDermott, Christian Gagné, Qiang Sun, Colm O'Riordan
TL;DR
This work develops a rigorous stochastic-differential-equation framework to model the temporal evolution of neural network Lipschitz continuity during SGD training. It decomposes the dynamics into layer- and network-level drift and diffusion terms driven by gradient flow projections, mini-batch gradient noise, and noise-curvature effects, with a detailed operator-norm perturbation analysis. A practical, low-rank gradient-noise estimator enables scalable computation of these dynamics on modern architectures, and the theory is validated on CIFAR-10/100 across multiple regularizers. The results reveal how initialization, batch size, label noise, and sampling trajectories shape Lipschitz growth, including near-convergence unbounded growth and noise-regularization effects, offering insights for robust, trustworthy deep learning systems.
Abstract
Lipschitz continuity characterizes the worst-case sensitivity of neural networks to small input perturbations; yet its dynamics (i.e. temporal evolution) during training remains under-explored. We present a rigorous mathematical framework to model the temporal evolution of Lipschitz continuity during training with stochastic gradient descent (SGD). This framework leverages a system of stochastic differential equations (SDEs) to capture both deterministic and stochastic forces. Our theoretical analysis identifies three principal factors driving the evolution: (i) the projection of gradient flows, induced by the optimization dynamics, onto the operator-norm Jacobian of parameter matrices; (ii) the projection of gradient noise, arising from the randomness in mini-batch sampling, onto the operator-norm Jacobian; and (iii) the projection of the gradient noise onto the operator-norm Hessian of parameter matrices. Furthermore, our theoretical framework sheds light on such as how noisy supervision, parameter initialization, batch size, and mini-batch sampling trajectories, among other factors, shape the evolution of the Lipschitz continuity of neural networks. Our experimental results demonstrate strong agreement between the theoretical implications and the observed behaviors.
