Minimax Optimal Convergence of Gradient Descent in Logistic Regression via Large and Adaptive Stepsizes
Ruiqi Zhang, Jingfeng Wu, Licong Lin, Peter L. Bartlett
TL;DR
This work analyzes gradient descent with large, risk-adaptive stepsizes for logistic regression on linearly separable data, revealing that GD can enter an edge-of-stability regime to achieve exp(-Θ(η)) risk after a burn-in of 1/γ^2 steps, with η tunable to drive arbitrarily small risk. It provides and proves a minimax lower bound of Ω(1/γ^2) steps for both batch and online first-order methods to find a separator, establishing minimax optimality for the proposed method up to constants. The results extend to a broad class of losses and to certain two-layer networks, broadening the applicability beyond logistic regression. Through a transformed objective analysis and split-optimization techniques, the paper shows how large adaptive steps can dramatically accelerate convergence while also clarifying the trade-offs with descent-lemma-based methods, and it situates these findings within the broader literature on EoS and aggressive stepsize strategies.
Abstract
We study $\textit{gradient descent}$ (GD) for logistic regression on linearly separable data with stepsizes that adapt to the current risk, scaled by a constant hyperparameter $η$. We show that after at most $1/γ^2$ burn-in steps, GD achieves a risk upper bounded by $\exp(-Θ(η))$, where $γ$ is the margin of the dataset. As $η$ can be arbitrarily large, GD attains an arbitrarily small risk $\textit{immediately after the burn-in steps}$, though the risk evolution may be $\textit{non-monotonic}$. We further construct hard datasets with margin $γ$, where any batch (or online) first-order method requires $Ω(1/γ^2)$ steps to find a linear separator. Thus, GD with large, adaptive stepsizes is $\textit{minimax optimal}$ among first-order batch methods. Notably, the classical $\textit{Perceptron}$ (Novikoff, 1962), a first-order online method, also achieves a step complexity of $1/γ^2$, matching GD even in constants. Finally, our GD analysis extends to a broad class of loss functions and certain two-layer networks.
