Table of Contents
Fetching ...

Guiding Two-Layer Neural Network Lipschitzness via Gradient Descent Learning Rate Constraints

Kyle Sung, Anastasis Kratsios, Noah Forman

TL;DR

The paper investigates whether Lipschitz regularity in a two-layer MLP can be steered by gradient-descent learning-rate schedules. It proves that an eventual LR decay, governed by a rate function $G$, induces a high-probability bound on the network's Lipschitz constant and maintains convergence of the empirical risk under the Hub er loss, while yielding generalization bounds with sub-linear dependence on the number of trainable parameters. The results show a width (parameter count)–independent generalization behavior and align with standard GD guarantees, suggesting overparameterization does not degrade statistical performance under the proposed LR-decay regime. Toy experiments corroborate the theory, revealing that decaying LR yields smaller Lipschitz constants without sacrificing predictive power, and that constant LR can exhibit similar learning and regularity properties, implying that standard GD may inherently promote regularity in these networks.

Abstract

We demonstrate that applying an eventual decay to the learning rate (LR) in empirical risk minimization (ERM), where the mean-squared-error loss is minimized using standard gradient descent (GD) for training a two-layer neural network with Lipschitz activation functions, ensures that the resulting network exhibits a high degree of Lipschitz regularity, that is, a small Lipschitz constant. Moreover, we show that this decay does not hinder the convergence rate of the empirical risk, now measured with the Huber loss, toward a critical point of the non-convex empirical risk. From these findings, we derive generalization bounds for two-layer neural networks trained with GD and a decaying LR with a sub-linear dependence on its number of trainable parameters, suggesting that the statistical behaviour of these networks is independent of overparameterization. We validate our theoretical results with a series of toy numerical experiments, where surprisingly, we observe that networks trained with constant step size GD exhibit similar learning and regularity properties to those trained with a decaying LR. This suggests that neural networks trained with standard GD may already be highly regular learners.

Guiding Two-Layer Neural Network Lipschitzness via Gradient Descent Learning Rate Constraints

TL;DR

The paper investigates whether Lipschitz regularity in a two-layer MLP can be steered by gradient-descent learning-rate schedules. It proves that an eventual LR decay, governed by a rate function , induces a high-probability bound on the network's Lipschitz constant and maintains convergence of the empirical risk under the Hub er loss, while yielding generalization bounds with sub-linear dependence on the number of trainable parameters. The results show a width (parameter count)–independent generalization behavior and align with standard GD guarantees, suggesting overparameterization does not degrade statistical performance under the proposed LR-decay regime. Toy experiments corroborate the theory, revealing that decaying LR yields smaller Lipschitz constants without sacrificing predictive power, and that constant LR can exhibit similar learning and regularity properties, implying that standard GD may inherently promote regularity in these networks.

Abstract

We demonstrate that applying an eventual decay to the learning rate (LR) in empirical risk minimization (ERM), where the mean-squared-error loss is minimized using standard gradient descent (GD) for training a two-layer neural network with Lipschitz activation functions, ensures that the resulting network exhibits a high degree of Lipschitz regularity, that is, a small Lipschitz constant. Moreover, we show that this decay does not hinder the convergence rate of the empirical risk, now measured with the Huber loss, toward a critical point of the non-convex empirical risk. From these findings, we derive generalization bounds for two-layer neural networks trained with GD and a decaying LR with a sub-linear dependence on its number of trainable parameters, suggesting that the statistical behaviour of these networks is independent of overparameterization. We validate our theoretical results with a series of toy numerical experiments, where surprisingly, we observe that networks trained with constant step size GD exhibit similar learning and regularity properties to those trained with a decaying LR. This suggests that neural networks trained with standard GD may already be highly regular learners.

Paper Structure

This paper contains 32 sections, 8 theorems, 65 equations, 7 figures.

Key Result

Theorem 3

In Setting setting, suppose the GD initialization satisfies Assumption ass:init. If the variable learning rate $\boldsymbol{\alpha}$ satisfies the GD LR Decay Conditions GradCons then there is an absolute constant $\kappa>0$ such that for every $\eta>0$ and every $T \in \mathbb{N}_+$, with probability at least $1-4 e^{-\eta^2}$, and $\kappa$ depends only on the distributions of $B_0$ and $W_0$.

Figures (7)

  • Figure 1: The Huber loss $(\ell)$: a $1$-Lipschitz surrogate of the mean squared error loss.
  • Figure 2: Learning Rate Decay: A decaying LR (blue curve) guarantees a desired Lipschitz regularity of the trained neural network (Theorem \ref{['thrm:lr_lip']}), while the standard constant step size (orange line) guarantees quadratic convergence to a stationary point of the ER. Following the constant orange line for a long enough time horizon $T>0$ and then tracing the blue curve's decay rate yields both quadratic convergence to a stationary point of the ER and the desired degree of Lipschitzness of the trained function (Theorem \ref{['thrm:Convergence_and_stability']}). In particular, the class of two-layer neural networks trained in this manner with GD tracing this two-part green "hybrid" curve generalize at a rate independent of their width (Corollary \ref{['cor:GeneralizationImplication']}).
  • Figure 3: Effect of Sample Size ($N$)
  • Figure 4: Effect of No. Parameters ($P$)
  • Figure 6: Target Function with Osculatory Derivative $f(x)=\sin(x)$: Since our two-layer networks have their Lipschitz constant implicitly restricted by the GD, it is natural to check if this inhibits their ability to learn a function with complicated, e.g. osculatory, derivative. Our experiments indicate that this is not the case. Here, $\beta=0.05$, $p=250$, $N=150$, and $\alpha_t = 0.01$ for $t=1,\ldots, 100$.
  • ...and 2 more figures

Theorems & Definitions (23)

  • Definition 1: Realization of a Two-Layer MLP
  • Definition 2: Rate Function
  • Example 1: Exponential and Polynomial Rate
  • Example 2: Hybrid-Exponential Rate
  • Theorem 3: Lipschitz Control Via Learning Rate Decay
  • Example 3: Lipschitz Bounds For Polynomial Learning Rate Decay
  • Example 4: Lipschitz Bounds For Hybrid-Exponential Learning Rate Decay
  • Corollary 4: Generalization Bounds for GD-Trained Networks with Linear Width Dependence
  • Example 5: Random Inputs on Spheres with Polynomial LR Decay
  • Theorem 5: Convergence at the Optimal GD-Rate
  • ...and 13 more