Table of Contents
Fetching ...

Criteria and Bias of Parameterized Linear Regression under Edge of Stability Regime

Peiyuan Zhang, Amin Karbasi

TL;DR

This work shows that Edge of Stability (EoS) can occur in gradient descent with a large step-size even when the loss is quadratic, in a regression task with quadratic parameterization β_w = w_+^2 - w_-^2. It combines empirical evidence and a rigorous one-sample analysis (d=2) to prove that GD converges to a linear interpolator β_∞ within the EoS regime, with distinct behavior depending on ημ < 1 or ημ > 1 and with bounds on generalization error relative to sparsity priors. The study connects EoS phenomena to implicit bias in depth-2 diagonal linear networks and extends insights to multi-sample, overparameterized settings, showing overparameterization (d ≥ n) is necessary for EoS in the quadratic-loss diagonal-linear-net setting. Overall, the paper broadens the understanding of EoS by showing that subquadratic loss is not a strict prerequisite and by detailing the phase-transition dynamics and convergence properties under large GD step-sizes. The results have implications for the design and analysis of optimization in overparameterized linear-model regimes and for interpreting implicit bias in practical neural-network-like architectures.

Abstract

Classical optimization theory requires a small step-size for gradient-based methods to converge. Nevertheless, recent findings challenge the traditional idea by empirically demonstrating Gradient Descent (GD) converges even when the step-size $η$ exceeds the threshold of $2/L$, where $L$ is the global smooth constant. This is usually known as the Edge of Stability (EoS) phenomenon. A widely held belief suggests that an objective function with subquadratic growth plays an important role in incurring EoS. In this paper, we provide a more comprehensive answer by considering the task of finding linear interpolator $β\in R^{d}$ for regression with loss function $l(\cdot)$, where $β$ admits parameterization as $β= w^2_{+} - w^2_{-}$. Contrary to the previous work that suggests a subquadratic $l$ is necessary for EoS, our novel finding reveals that EoS occurs even when $l$ is quadratic under proper conditions. This argument is made rigorous by both empirical and theoretical evidence, demonstrating the GD trajectory converges to a linear interpolator in a non-asymptotic way. Moreover, the model under quadratic $l$, also known as a depth-$2$ diagonal linear network, remains largely unexplored under the EoS regime. Our analysis then sheds some new light on the implicit bias of diagonal linear networks when a larger step-size is employed, enriching the understanding of EoS on more practical models.

Criteria and Bias of Parameterized Linear Regression under Edge of Stability Regime

TL;DR

This work shows that Edge of Stability (EoS) can occur in gradient descent with a large step-size even when the loss is quadratic, in a regression task with quadratic parameterization β_w = w_+^2 - w_-^2. It combines empirical evidence and a rigorous one-sample analysis (d=2) to prove that GD converges to a linear interpolator β_∞ within the EoS regime, with distinct behavior depending on ημ < 1 or ημ > 1 and with bounds on generalization error relative to sparsity priors. The study connects EoS phenomena to implicit bias in depth-2 diagonal linear networks and extends insights to multi-sample, overparameterized settings, showing overparameterization (d ≥ n) is necessary for EoS in the quadratic-loss diagonal-linear-net setting. Overall, the paper broadens the understanding of EoS by showing that subquadratic loss is not a strict prerequisite and by detailing the phase-transition dynamics and convergence properties under large GD step-sizes. The results have implications for the design and analysis of optimization in overparameterized linear-model regimes and for interpreting implicit bias in practical neural-network-like architectures.

Abstract

Classical optimization theory requires a small step-size for gradient-based methods to converge. Nevertheless, recent findings challenge the traditional idea by empirically demonstrating Gradient Descent (GD) converges even when the step-size exceeds the threshold of , where is the global smooth constant. This is usually known as the Edge of Stability (EoS) phenomenon. A widely held belief suggests that an objective function with subquadratic growth plays an important role in incurring EoS. In this paper, we provide a more comprehensive answer by considering the task of finding linear interpolator for regression with loss function , where admits parameterization as . Contrary to the previous work that suggests a subquadratic is necessary for EoS, our novel finding reveals that EoS occurs even when is quadratic under proper conditions. This argument is made rigorous by both empirical and theoretical evidence, demonstrating the GD trajectory converges to a linear interpolator in a non-asymptotic way. Moreover, the model under quadratic , also known as a depth- diagonal linear network, remains largely unexplored under the EoS regime. Our analysis then sheds some new light on the implicit bias of diagonal linear networks when a larger step-size is employed, enriching the understanding of EoS on more practical models.

Paper Structure

This paper contains 33 sections, 21 theorems, 148 equations, 9 figures.

Key Result

Theorem 1

Suppose Assumption asmp: one-sample and change of sign $r_tr_{t+1}$ occurs for any $t$ larger than some integer $t_0$. Let $\eta \mu \in (0, 1)$ and $\alpha^2 \leq O(1)$, then the GD iteration in (eq: GD) converges with a linear rate to the limit ${\bm \beta}_{\infty}$ as Moreover, $\| {\bm \beta}_{\infty} - {\bm \beta}^* \| \leq O(\alpha^{C_2})$. $C_1, C_2>0$ are some constants.

Figures (9)

  • Figure 1: Comparison between EoS and GF regime, represented by blue and red lines, under parameterized linear regression in (\ref{['eq: erm']}) with $l(a) = a^2/4$. The plots from left to right illustrate the trajectory of regression weight ${\bm \beta}_{{\bm{w}}_t}$ (star and triangle mark the stable points), the decrease of objective and $\eta S_t$, respectively, where $S_t$ is the sharpness at iteration $t$. EoS is featured by the $\eta S_t > 2$. Unlike previous assertions, we observe EoS also occurs with quadratic $l(a) = a^2/4$. Rest parameters: ${\bm{x}} = (1, 0.5)$, $y=1$ and $\alpha = 0.01$.
  • Figure 2: Empirical verification for the Claim \ref{['claim: main-claim']}. In the left two columns of plots, we run with configurations that obey Claim \ref{['claim: main-claim']} and EoS occurs if we increase step-size. In contrast, we set $d =1$ in the third column and $y =0$ in the fourth column, under these settings GD becomes divergent without triggering EoS when we increase the step-size. Note that we use a modified initialization ${\bm{w}}_{0,+}=2\alpha{\bm{1}}$, ${\bm{w}}_{0,-}=\alpha{\bm{1}}$ in the last column ($y = 0$), otherwise the $r_t =0$ for any $t$ under the original initialization.
  • Figure 3: Influence of $\eta$ and different asymptotic properties of $r_t$ along GD trajectory. When we increase the step-size, it displays, from left to right, GF regime, different subregimes of EoS, chaos, and divergence. In particular, when $x$ is larger than some threshold (see Theorem \ref{['thm: large-ss']} for details), GD does not converge when $\mu\eta > 1$. Parameter configuration: $\mu = 1$, $\alpha=0.01$.
  • Figure 4: $\alpha$ decides the length of the intermediate phase in $\eta\mu > 1$: the gap between the start of oscillation $t_0$ and the start of convergence $\mathfrak{t}$ is proportional to $\log(1/\alpha)$. This is because in the intermediate phase, $r_t$ remains roughly as a constant and causes $b_t$ to increase almost linearly from the scale of $\alpha^{\Theta(1)}$ to $O(1)$. We use $x = 0.5$, $\eta = 1.1$ and $\mu =1$.
  • Figure 5: Relationship between error $\|{\bm \beta}_{\infty} - {\bm \beta}^*\|$, $\alpha$ and $\eta$: the $x$-axis is $\alpha$ and $y$-axis is the error. The left plot characterizes the error under $\mu\eta > 1$ and the right plot is for regime $\mu\eta < 1$. Rest parameters: $x = 0.5$, $\mu = 1$. The $x$-axis of both plots are in $\alpha$.
  • ...and 4 more figures

Theorems & Definitions (42)

  • Claim 1
  • Theorem 1
  • Theorem 2
  • Proposition 1
  • Lemma 1
  • Lemma 2
  • Lemma 3: Informal
  • Lemma 4
  • proof
  • Lemma 5
  • ...and 32 more