Criteria and Bias of Parameterized Linear Regression under Edge of Stability Regime
Peiyuan Zhang, Amin Karbasi
TL;DR
This work shows that Edge of Stability (EoS) can occur in gradient descent with a large step-size even when the loss is quadratic, in a regression task with quadratic parameterization β_w = w_+^2 - w_-^2. It combines empirical evidence and a rigorous one-sample analysis (d=2) to prove that GD converges to a linear interpolator β_∞ within the EoS regime, with distinct behavior depending on ημ < 1 or ημ > 1 and with bounds on generalization error relative to sparsity priors. The study connects EoS phenomena to implicit bias in depth-2 diagonal linear networks and extends insights to multi-sample, overparameterized settings, showing overparameterization (d ≥ n) is necessary for EoS in the quadratic-loss diagonal-linear-net setting. Overall, the paper broadens the understanding of EoS by showing that subquadratic loss is not a strict prerequisite and by detailing the phase-transition dynamics and convergence properties under large GD step-sizes. The results have implications for the design and analysis of optimization in overparameterized linear-model regimes and for interpreting implicit bias in practical neural-network-like architectures.
Abstract
Classical optimization theory requires a small step-size for gradient-based methods to converge. Nevertheless, recent findings challenge the traditional idea by empirically demonstrating Gradient Descent (GD) converges even when the step-size $η$ exceeds the threshold of $2/L$, where $L$ is the global smooth constant. This is usually known as the Edge of Stability (EoS) phenomenon. A widely held belief suggests that an objective function with subquadratic growth plays an important role in incurring EoS. In this paper, we provide a more comprehensive answer by considering the task of finding linear interpolator $β\in R^{d}$ for regression with loss function $l(\cdot)$, where $β$ admits parameterization as $β= w^2_{+} - w^2_{-}$. Contrary to the previous work that suggests a subquadratic $l$ is necessary for EoS, our novel finding reveals that EoS occurs even when $l$ is quadratic under proper conditions. This argument is made rigorous by both empirical and theoretical evidence, demonstrating the GD trajectory converges to a linear interpolator in a non-asymptotic way. Moreover, the model under quadratic $l$, also known as a depth-$2$ diagonal linear network, remains largely unexplored under the EoS regime. Our analysis then sheds some new light on the implicit bias of diagonal linear networks when a larger step-size is employed, enriching the understanding of EoS on more practical models.
