Table of Contents
Fetching ...

Any-stepsize Gradient Descent for Separable Data under Fenchel-Young Losses

Han Bao, Shinsaku Sakaue, Yuki Takezawa

TL;DR

This work analyzes gradient descent with arbitrary stepsize for linearly separable data under Fenchel–Young losses, showing convergence to an ε-optimal loss without requiring the self-bounding property when the loss has a separation margin. The key mechanism is a perceptron-style argument that tracks alignment with the separating direction, yielding rates T = Ω(ε^{-α}) where α depends on the potential φ generating the Fenchel–Young loss; in concrete instances, Tsallis entropy achieves Ω(ε^{-1/2}) and Rényi entropy achieves Ω(ε^{-1/3}) under margin. By instantiating with Shannon, Tsallis, and Rényi entropies, the paper demonstrates how the separation margin drives faster convergence than classical GD in the stable regime, while the logistic (Shannon) case recovers the standard rate when margin is absent. The results illuminate the role of margin over self-bounding for large-step GD, discuss stochastic extensions and implicit-bias implications, and lay out open problems including data-dependence and finite-time behavior.

Abstract

The gradient descent (GD) has been one of the most common optimizer in machine learning. In particular, the loss landscape of a neural network is typically sharpened during the initial phase of training, making the training dynamics hover on the edge of stability. This is beyond our standard understanding of GD convergence in the stable regime where arbitrarily chosen stepsize is sufficiently smaller than the edge of stability. Recently, Wu et al. (COLT2024) have showed that GD converges with arbitrary stepsize under linearly separable logistic regression. Although their analysis hinges on the self-bounding property of the logistic loss, which seems to be a cornerstone to establish a modified descent lemma, our pilot study shows that other loss functions without the self-bounding property can make GD converge with arbitrary stepsize. To further understand what property of a loss function matters in GD, we aim to show arbitrary-stepsize GD convergence for a general loss function based on the framework of \emph{Fenchel--Young losses}. We essentially leverage the classical perceptron argument to derive the convergence rate for achieving $ε$-optimal loss, which is possible for a majority of Fenchel--Young losses. Among typical loss functions, the Tsallis entropy achieves the GD convergence rate $T=Ω(ε^{-1/2})$, and the R{é}nyi entropy achieves the far better rate $T=Ω(ε^{-1/3})$. We argue that these better rate is possible because of \emph{separation margin} of loss functions, instead of the self-bounding property.

Any-stepsize Gradient Descent for Separable Data under Fenchel-Young Losses

TL;DR

This work analyzes gradient descent with arbitrary stepsize for linearly separable data under Fenchel–Young losses, showing convergence to an ε-optimal loss without requiring the self-bounding property when the loss has a separation margin. The key mechanism is a perceptron-style argument that tracks alignment with the separating direction, yielding rates T = Ω(ε^{-α}) where α depends on the potential φ generating the Fenchel–Young loss; in concrete instances, Tsallis entropy achieves Ω(ε^{-1/2}) and Rényi entropy achieves Ω(ε^{-1/3}) under margin. By instantiating with Shannon, Tsallis, and Rényi entropies, the paper demonstrates how the separation margin drives faster convergence than classical GD in the stable regime, while the logistic (Shannon) case recovers the standard rate when margin is absent. The results illuminate the role of margin over self-bounding for large-step GD, discuss stochastic extensions and implicit-bias implications, and lay out open problems including data-dependence and finite-time behavior.

Abstract

The gradient descent (GD) has been one of the most common optimizer in machine learning. In particular, the loss landscape of a neural network is typically sharpened during the initial phase of training, making the training dynamics hover on the edge of stability. This is beyond our standard understanding of GD convergence in the stable regime where arbitrarily chosen stepsize is sufficiently smaller than the edge of stability. Recently, Wu et al. (COLT2024) have showed that GD converges with arbitrary stepsize under linearly separable logistic regression. Although their analysis hinges on the self-bounding property of the logistic loss, which seems to be a cornerstone to establish a modified descent lemma, our pilot study shows that other loss functions without the self-bounding property can make GD converge with arbitrary stepsize. To further understand what property of a loss function matters in GD, we aim to show arbitrary-stepsize GD convergence for a general loss function based on the framework of \emph{Fenchel--Young losses}. We essentially leverage the classical perceptron argument to derive the convergence rate for achieving -optimal loss, which is possible for a majority of Fenchel--Young losses. Among typical loss functions, the Tsallis entropy achieves the GD convergence rate , and the R{é}nyi entropy achieves the far better rate . We argue that these better rate is possible because of \emph{separation margin} of loss functions, instead of the self-bounding property.

Paper Structure

This paper contains 31 sections, 26 theorems, 183 equations, 3 figures, 1 table.

Key Result

Theorem 1

Consider a binary classification dataset that is linearly separable. We run equation:gd with arbitrary constant stepsize $\eta>0$ and initialization $\mathbf{w}_0=\mathbf{0}$ under a Fenchel--Young loss generated by twice continuously differentiable and convex potential $\phi$ with separation margin we have $L(\mathbf{w}_T)\le\varepsilon$. Throughout this paper, we consider $\eta=\Theta(1)$ with r

Figures (3)

  • Figure 1: Pilot studies of GD with the same toy dataset as Wu2024COLT. The dataset consists of four points, $\mathbf{x}_1=[1,0.2]^\top$, $y_1=1$, $\mathbf{x}_2=[-2,0.2]^\top$, $y_2=1$, $\mathbf{x}_3=[-1,-0.2]^\top$, $y_3=-1$, $\mathbf{x}_4=[2,-0.2]^\top$, and $y_4=-1$. GD is run with initialization $\mathbf{w}_0=[0,0]^\top$. Note that the logistic loss corresponds to the Tsallis $1$-loss. The Tsallis $2$- and $q$-loss are also known as the modified Huber loss Zhang2004ICML and $q$-entmax loss Peters2019ACL, respectively.
  • Figure 2: Under the same setup as \ref{['figure:pilot']}, we show $\|\mathbf{w}_t\|$ along the number of steps $t$ with different losses.
  • Figure 3: For the pseudo-spherical entropy, $\alpha(\mu)=[\phi'(\mu)/\mu\phi"(\mu)] \cdot [1-\phi(\mu)/\mu\phi'(\mu)]$ is shown.

Theorems & Definitions (48)

  • Theorem 1: Informal version of \ref{['theorem:gd']}
  • Definition 2
  • Definition 3
  • Proposition 4: Blondel2020JMLR
  • Theorem 5: Main result
  • Corollary 6
  • Lemma 7
  • proof
  • Lemma 8
  • proof
  • ...and 38 more