Table of Contents
Fetching ...

Minimax Optimal Convergence of Gradient Descent in Logistic Regression via Large and Adaptive Stepsizes

Ruiqi Zhang, Jingfeng Wu, Licong Lin, Peter L. Bartlett

TL;DR

This work analyzes gradient descent with large, risk-adaptive stepsizes for logistic regression on linearly separable data, revealing that GD can enter an edge-of-stability regime to achieve exp(-Θ(η)) risk after a burn-in of 1/γ^2 steps, with η tunable to drive arbitrarily small risk. It provides and proves a minimax lower bound of Ω(1/γ^2) steps for both batch and online first-order methods to find a separator, establishing minimax optimality for the proposed method up to constants. The results extend to a broad class of losses and to certain two-layer networks, broadening the applicability beyond logistic regression. Through a transformed objective analysis and split-optimization techniques, the paper shows how large adaptive steps can dramatically accelerate convergence while also clarifying the trade-offs with descent-lemma-based methods, and it situates these findings within the broader literature on EoS and aggressive stepsize strategies.

Abstract

We study $\textit{gradient descent}$ (GD) for logistic regression on linearly separable data with stepsizes that adapt to the current risk, scaled by a constant hyperparameter $η$. We show that after at most $1/γ^2$ burn-in steps, GD achieves a risk upper bounded by $\exp(-Θ(η))$, where $γ$ is the margin of the dataset. As $η$ can be arbitrarily large, GD attains an arbitrarily small risk $\textit{immediately after the burn-in steps}$, though the risk evolution may be $\textit{non-monotonic}$. We further construct hard datasets with margin $γ$, where any batch (or online) first-order method requires $Ω(1/γ^2)$ steps to find a linear separator. Thus, GD with large, adaptive stepsizes is $\textit{minimax optimal}$ among first-order batch methods. Notably, the classical $\textit{Perceptron}$ (Novikoff, 1962), a first-order online method, also achieves a step complexity of $1/γ^2$, matching GD even in constants. Finally, our GD analysis extends to a broad class of loss functions and certain two-layer networks.

Minimax Optimal Convergence of Gradient Descent in Logistic Regression via Large and Adaptive Stepsizes

TL;DR

This work analyzes gradient descent with large, risk-adaptive stepsizes for logistic regression on linearly separable data, revealing that GD can enter an edge-of-stability regime to achieve exp(-Θ(η)) risk after a burn-in of 1/γ^2 steps, with η tunable to drive arbitrarily small risk. It provides and proves a minimax lower bound of Ω(1/γ^2) steps for both batch and online first-order methods to find a separator, establishing minimax optimality for the proposed method up to constants. The results extend to a broad class of losses and to certain two-layer networks, broadening the applicability beyond logistic regression. Through a transformed objective analysis and split-optimization techniques, the paper shows how large adaptive steps can dramatically accelerate convergence while also clarifying the trade-offs with descent-lemma-based methods, and it situates these findings within the broader literature on EoS and aggressive stepsize strategies.

Abstract

We study (GD) for logistic regression on linearly separable data with stepsizes that adapt to the current risk, scaled by a constant hyperparameter . We show that after at most burn-in steps, GD achieves a risk upper bounded by , where is the margin of the dataset. As can be arbitrarily large, GD attains an arbitrarily small risk , though the risk evolution may be . We further construct hard datasets with margin , where any batch (or online) first-order method requires steps to find a linear separator. Thus, GD with large, adaptive stepsizes is among first-order batch methods. Notably, the classical (Novikoff, 1962), a first-order online method, also achieves a step complexity of , matching GD even in constants. Finally, our GD analysis extends to a broad class of loss functions and certain two-layer networks.

Paper Structure

This paper contains 38 sections, 17 theorems, 92 equations, 2 tables.

Key Result

Theorem 2.1

Consider eqn.GD.logistic.maintext with adaptive stepsizes eqn.lr.scheduler.maintext for logistic regression eqn.loss.LR.maintext under assumption.data.maintext. Assume without loss of generality that $\mathbf{w}_0=\mathbf{0}$. Then for every $t\ge 1$ and $\eta > 0$, we have In particular, after $1/\gamma^2$ burn-in steps, for every $\eta > 0$, we have

Theorems & Definitions (33)

  • Theorem 2.1: GD with large and adaptive stepsizes
  • Theorem 2.2: A lower bound for GD in the stable regime
  • Proposition 2.3: Corollary 2 in wu2024large
  • Proposition 2.4: Consequences of Theorem 2.2 in ji2021characterizing
  • Proposition 2.5: Consequences of Lemmas C.7 and C.12 in ji2021fast
  • proof : Proof of \ref{['thm.LR.maintext']}
  • Definition 3.1: First-order batch methods
  • Theorem 3.2: A lower bound for first-order batch methods
  • Definition 3.3: First-order online methods
  • Theorem 3.4: Lower bounds for online first-order methods
  • ...and 23 more