Large Stepsize Gradient Descent for Non-Homogeneous Two-Layer Networks: Margin Improvement and Fast Optimization

Yuhang Cai; Jingfeng Wu; Song Mei; Michael Lindsey; Peter L. Bartlett

Large Stepsize Gradient Descent for Non-Homogeneous Two-Layer Networks: Margin Improvement and Fast Optimization

Yuhang Cai, Jingfeng Wu, Song Mei, Michael Lindsey, Peter L. Bartlett

TL;DR

The paper analyzes large-stepsize gradient descent for non-homogeneous two-layer networks under the logistic loss, revealing a two-phase training dynamic: an initial edge-of-stability phase with oscillatory empirical risk, followed by a stable phase where risk decreases and margins grow. It proves that the stable phase emerges once the sublevel risk falls below a stepsize-dependent threshold, and that the normalized margin increases nearly monotonically in this phase, indicating an implicit bias toward margin maximization for non-homogeneous predictors. Under linear separability and bounded activation derivatives, the first phase ends in finite time and the authors show a faster $\tilde{O}(1/t^2)$ decay in empirical risk with large stepsizes, contrasted with $\Omega(1/t)$ for monotone descent. The theory extends margin and optimization results beyond linear and mean-field/NTK regimes to networks of any width, and is corroborated by experiments on CIFAR-10 subsets and XOR data, demonstrating margin growth and accelerated convergence with large learning rates. Overall, the work provides a unified framework for understanding and leveraging large stepsize GD in training non-homogeneous neural networks, with practical implications for training efficiency and generalization.

Abstract

The typical training of neural networks using large stepsize gradient descent (GD) under the logistic loss often involves two distinct phases, where the empirical risk oscillates in the first phase but decreases monotonically in the second phase. We investigate this phenomenon in two-layer networks that satisfy a near-homogeneity condition. We show that the second phase begins once the empirical risk falls below a certain threshold, dependent on the stepsize. Additionally, we show that the normalized margin grows nearly monotonically in the second phase, demonstrating an implicit bias of GD in training non-homogeneous predictors. If the dataset is linearly separable and the derivative of the activation function is bounded away from zero, we show that the average empirical risk decreases, implying that the first phase must stop in finite steps. Finally, we demonstrate that by choosing a suitably large stepsize, GD that undergoes this phase transition is more efficient than GD that monotonically decreases the risk. Our analysis applies to networks of any width, beyond the well-known neural tangent kernel and mean-field regimes.

Large Stepsize Gradient Descent for Non-Homogeneous Two-Layer Networks: Margin Improvement and Fast Optimization

TL;DR

decay in empirical risk with large stepsizes, contrasted with

for monotone descent. The theory extends margin and optimization results beyond linear and mean-field/NTK regimes to networks of any width, and is corroborated by experiments on CIFAR-10 subsets and XOR data, demonstrating margin growth and accelerated convergence with large learning rates. Overall, the work provides a unified framework for understanding and leveraging large stepsize GD in training non-homogeneous neural networks, with practical implications for training efficiency and generalization.

Abstract

Paper Structure (25 sections, 6 theorems, 21 equations, 2 figures)

This paper contains 25 sections, 6 theorems, 21 equations, 2 figures.

Introduction
Setup.
Observation.
Contributions.
Stable Phase and Margin Improvement
Margin for nearly homogenous predictors.
Limitations.
Comparisons to existing works.
Edge of Stability Phase
Phase Transition and Fast Optimization
Fast optimization.
Effect of model rescaling.
Experiments
Margin improvement.
Fast optimization.
...and 10 more sections

Key Result

Theorem 2.2

Consider eq: GD with stepsize $\tilde{\eta}$ on a predictor $f(\mathbf{w};\mathbf{x})$ that satisfies assump:model:bounded-gradassump:model:smooth. If there exists $r \ge 0$ such that then GD is in the stable phase for $t\ge r$, that is, $(L(\mathbf{w}_t))_{t\ge r}$ decreases monotonically. If additionally the predictor satisfies assump:model:near-homogeneous and there exists $s \ge 0$ such that

Figures (2)

Figure 1: The behavior of \ref{['eq: GD']} for optimizing a non-homogenous four-layer MLP with GELU activation function on a subset of CIFAR-10 dataset. We randomly sample $6,000$ data with labels "airplane" and "automobile" from CIFAR-10 dataset. The normalized margin is defined as $(\arg\min_{i\in [n]} y_if (\mathbf{w}_t;\mathbf{x}_i)) / \|\mathbf{w}_t\|^4$, which is close to \ref{['eq: norm margin']}. The blue curves correspond to GD with a large stepsize $\tilde{\eta}=0.2$, where the empirical risk oscillates in the first phase but decreases monotonically in the second phase. The orange curves correspond to GD with a small stepsize $\tilde{\eta} =0.005$, where the empirical risk decreases monotonically. Furthermore, \ref{['fig:sfig1-margin']} suggests the normalized margins of both two curves increase and converge in the stable phases. Finally, \ref{['fig:sfig1-acc']} suggests that large stepsize achieves a better test accuracy, consistent with larger-scale learning experiment hoffer2017traingoyal2017accurate. More details can be found in \ref{['sec:experiments']}.
Figure 2: Behavior of \ref{['eq: GD']} for two-layer networks \ref{['eq: 2nn']} with leaky softplus activation function (see \ref{['eg: leak homo act']} with $c=0.5$). We consider an XOR dataset and a subset of CIFAR-10 dataset. In both cases, we observe that (1) GD with a large stepsize achieves a faster optimization compared to GD with a small stepsize, (2) the asymptotic convergence rate of the empirical risk is $\mathcal{O}(1/ (\tilde{\eta} t))$, and (3) in the stable phase, the normalized margin (nearly) monotonically increases. These observations are consistent with our theoretical understanding of large stepsize GD. More details about the experiments are explained in \ref{['sec:experiments']}.

Theorems & Definitions (10)

Example 2.1: Two-layer networks
Theorem 2.2: Stable phase and margin improvement
Example 3.1: Leaky activation functions
Theorem 3.2: The EoS phase for two-layer networks
Theorem 4.1: Phase transition and stable phase for two-layer networks
Corollary 4.2: Acceleration of large stepsize
Theorem 4.3: Lower bound in the classical regime
Definition 1: Linearization error
Lemma A.2: Self-boundedness of logistic loss
proof : Proof of \ref{['lem: Self-bounded of logistic loss\\\\@#STOP']}] See the proof of Proposition 5 in wu2024large. The lower bound is by the convexity of $\ell(\cdot)$. The next lemma controls the decrease of the risk $L_t$. Suppose \ref{['assump:model:bounded-grad', 'assump:model:smooth']} hold. If $L(\mathbf{w}_t) \le \frac{1}{\tilde{\eta} \rho^2}$, then we have $-\tilde{\eta} (1+ \beta \tilde{\eta} L(\mathbf{w}_t)) \|\nabla L(\mathbf{w}_t)\|^2 \le L(\mathbf{w}_{t+1}) - L(\mathbf{w}_t) \le -\tilde{\eta} (1-(2\rho^2 + \beta )\tilde{\eta} L(\mathbf{w}_t)) \|\nabla L(\mathbf{w}_t)\|^2.$ Particularly, this indicates that if $L(\mathbf{w}_t) \le \frac{1}{\tilde{\eta} (2\rho^2 + \beta)}$, then $L(\mathbf{w}_{t+1}) \le L(\mathbf{w}_t)$. By \ref{['assump:model:bounded-grad', 'assump:model:smooth']}, we have $\| \nabla f\|_2 \le \rho$ and $f(\mathbf{w};\mathbf{x})$ is $\beta$-smooth as a function of $\mathbf{w}$. Therefore, for every $i\in[n]$ we have $\left|q_i(t+1)-q_i(t)\right|=\left|y_i(f\left(\mathbf{w}_{t+1} ; \mathbf{x}_i\right)-f\left(\mathbf{w}_t ; \mathbf{x}_i\right))\right|=\left|\nabla f\left(\mathbf{w}_t+\theta\left(\mathbf{w}_{t+1}-\mathbf{w}_t\right) ; \mathbf{x}_i\right)^{\top}\left(\mathbf{w}_{t+1}-\mathbf{w}_t\right)\right|\text{by intermediate value theorem}\leq \rho \left\|\mathbf{w}_{t+1}-\mathbf{w}_t\right\|\le \rho\tilde{\eta} \|\nabla L_t\|\text{since $\mathbf{w}_{t+1} = \mathbf{w}_t - \tilde{\eta} \nabla L_t$}\le \rho^2 \tilde{\eta} L_t \le 1.\text{since $\|\nabla L_t \| \le L_t \rho$}$ Then by \ref{['lem: Self-bounded of logistic loss]']}, we have \ell(q_i(t+1))\le \ell(q_i(t)) + \ell^\prime(q_i(t)) (q_i(t+1) - q_i(t)) + 2\ell(q_i(t)) (q_i(t+1) - q_i(t))^2\le \ell(q_i(t)) + \ell^\prime(q_i(t)) \langle y_i \nabla f(\mathbf{w}_t; \mathbf{x}_i), \mathbf{w}_{t+1} - \mathbf{w}_t \rangle + |\ell'(q_i(t))| \cdot | \xi[f](\mathbf{w}_t, \mathbf{w}_{t+1})|\quad + 2\ell(q_i(t)) (q_i(t+1) - q_i(t))^2\quad \quad \text{ since $q_{i}(t+1) - q_i(t) = \langle y_i \nabla f(\mathbf{w}_t; \mathbf{x}_i), \mathbf{w}_{t+1} - \mathbf{w}_t \rangle + y_i \xi[f] (\mathbf{w}_t, \mathbf{w}_{t+1})$ }\le \ell(q_i(t)) + \ell^\prime(q_i(t)) \langle y_i \nabla f(\mathbf{w}_t; \mathbf{x}_i), \mathbf{w}_{t+1} - \mathbf{w}_t \rangle + \ell(q_i(t)) (\beta + 2\rho^2) \|\mathbf{w}_{t+1} - \mathbf{w}_t\|^2 .\quad \quad \text{ by }\ref{['lem: Linearized error of beta-smooth function']}\text{ and the previous inequality } Taking an average over all data points, we have $L_{t+1} \le L_t -\tilde{\eta} \|\nabla L_t\|^2 + (2\rho^2 + \beta) \tilde{\eta}^2 L_t \|\nabla L_t\|^2,$ which is equivalent to $L_{t+1} - L_t \le -\tilde{\eta} (1-(2\rho^2 + \beta) \tilde{\eta} L_t) \|\nabla L_t\|^2.$ We complete the proof of the right hand side inequality. The left hand side inequality can be proved similarly. In detail, we can show that: \ell(q_i(t+1))\ge \ell(q_i(t)) + \ell^{\prime}(q_i(t)) (q_i(t+1) - q_i(t)\ge \ell(q_i(t)) + \ell^\prime (q_i(t)) \langle y_i \nabla f(\mathbf{w}_t;\mathbf{x}_i), \mathbf{w}_{t+1} - \mathbf{w}_t \rangle - |\ell^\prime (q_i(t))| \cdot | \xi[f] (\mathbf{w}_t, \mathbf{w}_{t+1})|. Taking the average over all data points, we have $L_{t+1} \ge L_t -\tilde{\eta} (1+ \beta \tilde{\eta} L_t) \|\nabla L_t\|^2.$ Now we have completed the proof of \ref{['lem: Decrease of Lt']}. In this section, we demonstrate that the parameter norm, $\rho_t$, increases monotonically during the stable phase. We introduce a crucial quantity, $v_t$, defined as the inner product of the gradient and the negative weight vector: $v_t \coloneqq \langle \nabla L(\mathbf{w}_t), -\mathbf{w}_t \rangle.$ This quantity, $v_t$, plays a key role in controlling the increase of the parameter norm. Notably, $v_t$ appears as the cross term in the expression $\|\mathbf{w}_{t+1}\|^2 = \|\mathbf{w}_t - \tilde{\eta} \nabla L(\mathbf{w}_t)\|^2$. By managing $v_t$, we can effectively characterize the increase in the parameter norm. Recall that our loss function is $\ell(x) := \log(1+e^{-x})$. Inspired by lyu2020gradient, we define the following two auxiliary functions for the logistic loss: $\psi(x)\coloneqq -\log (\ell(x)) = -\log\log(1+e^{-x}), \quad x\in\mathbb{R},\iota(x)\coloneqq \psi^{-1}(x) = -\log (e^{e^{-x}}-1),\quad x\in\mathbb{R}.$ One important remark is that if we change the loss to the exponential loss, both $\psi$ and $\iota$ will be the identity function. Since the logistic loss and the exponential loss have similar tails, our $\psi(x)$ and $\iota(x)$ are close to the identity function, i.e., $\psi(x) \approx \iota(x) \approx x, \quad \text{for $x$ large enough.}$ Then, we have an exponential-loss-like decomposition of $L_t$: $L_t = \frac{1}{n} \sum_{i=1}^n \ell(q_i(t)) = \frac{1}{n} \sum_{i=1}^n e^{-\psi(q_i(t))}.$ These two functions $\psi, \iota$ will help us to analyze the lower bound of $v_t$. First, we list some properties of $\psi$ and $\iota$ here. The following claims hold for $\ell$, $\psi$, and $\iota$. $\ell(x) = e^{-\psi(x)}$.$\ell$ is monotonically decreasing, while $\psi$ and $\iota$ are monotonically increasing.$\psi^\prime ( \iota(x)) = \frac{1}{\iota^\prime (x)}$;$\psi^\prime (x) x$ is increasing for $x\in (0, +\infty)$. The first two properties are straightforward. For the third property, we apply chain rule on $\psi(\iota(x)) = x$ to get $\psi^\prime (\iota(x)) \iota^\prime (x) = 1.$ For the fourth property, notice that $\psi^\prime (x) x = \frac{x}{(1+e^x)\log(1+e^{-x})}.$ The denominator is positive and decreasing since $\frac{d}{dx} [(1+e^x)\log(1+e^{-x})] = e^x \log(1+e^{-x}) - 1 \le e^x e^{-x} -1 =0.$ Combining this with the fact that $x$ is positive and increasing, we have completed the proof of \ref{['lem: Auxiliary functions of ell']}. Besides, we have the following property of $\iota$. This is the key lemma to handle the homogeneous error. Actually, this lemma is another way to show $\iota(x)$ is close to the identity function. For every $x\in\mathbb{R}$, we have $\frac{\iota (x)}{\iota^\prime (x)} \ge x+\log \log 2.$

Large Stepsize Gradient Descent for Non-Homogeneous Two-Layer Networks: Margin Improvement and Fast Optimization

TL;DR

Abstract

Large Stepsize Gradient Descent for Non-Homogeneous Two-Layer Networks: Margin Improvement and Fast Optimization

Authors

TL;DR

Abstract

Table of Contents

Key Result

Figures (2)

Theorems & Definitions (10)