Guiding Two-Layer Neural Network Lipschitzness via Gradient Descent Learning Rate Constraints
Kyle Sung, Anastasis Kratsios, Noah Forman
TL;DR
The paper investigates whether Lipschitz regularity in a two-layer MLP can be steered by gradient-descent learning-rate schedules. It proves that an eventual LR decay, governed by a rate function $G$, induces a high-probability bound on the network's Lipschitz constant and maintains convergence of the empirical risk under the Hub er loss, while yielding generalization bounds with sub-linear dependence on the number of trainable parameters. The results show a width (parameter count)–independent generalization behavior and align with standard GD guarantees, suggesting overparameterization does not degrade statistical performance under the proposed LR-decay regime. Toy experiments corroborate the theory, revealing that decaying LR yields smaller Lipschitz constants without sacrificing predictive power, and that constant LR can exhibit similar learning and regularity properties, implying that standard GD may inherently promote regularity in these networks.
Abstract
We demonstrate that applying an eventual decay to the learning rate (LR) in empirical risk minimization (ERM), where the mean-squared-error loss is minimized using standard gradient descent (GD) for training a two-layer neural network with Lipschitz activation functions, ensures that the resulting network exhibits a high degree of Lipschitz regularity, that is, a small Lipschitz constant. Moreover, we show that this decay does not hinder the convergence rate of the empirical risk, now measured with the Huber loss, toward a critical point of the non-convex empirical risk. From these findings, we derive generalization bounds for two-layer neural networks trained with GD and a decaying LR with a sub-linear dependence on its number of trainable parameters, suggesting that the statistical behaviour of these networks is independent of overparameterization. We validate our theoretical results with a series of toy numerical experiments, where surprisingly, we observe that networks trained with constant step size GD exhibit similar learning and regularity properties to those trained with a decaying LR. This suggests that neural networks trained with standard GD may already be highly regular learners.
