Table of Contents
Fetching ...

Convergence of Gradient Descent on Separable Data

Mor Shpigel Nacson, Jason D. Lee, Suriya Gunasekar, Pedro H. P. Savarese, Nathan Srebro, Daniel Soudry

TL;DR

The paper analyzes the implicit bias of gradient descent on separable data for a broad class of strictly monotone losses. It shows that losses with super-polynomial tails drive gradient descent, including deep linear networks, toward the $L_2$ maximum-margin separator, and it characterizes the margin convergence rates, establishing that exponential tails achieve the best fixed-step rate. It further demonstrates that using normalized gradient descent or variable step sizes can accelerate margin convergence to $O(\log t/\sqrt{t})$ for exponential losses, with empirical evidence across synthetic and image classification tasks. The results illuminate why exponential-tailed losses, such as logistic loss, are effective and suggest practical acceleration strategies that may extend to more complex neural networks, while highlighting the role of tail behavior and depth in the implicit bias of optimization.

Abstract

We provide a detailed study on the implicit bias of gradient descent when optimizing loss functions with strictly monotone tails, such as the logistic loss, over separable datasets. We look at two basic questions: (a) what are the conditions on the tail of the loss function under which gradient descent converges in the direction of the $L_2$ maximum-margin separator? (b) how does the rate of margin convergence depend on the tail of the loss function and the choice of the step size? We show that for a large family of super-polynomial tailed losses, gradient descent iterates on linear networks of any depth converge in the direction of $L_2$ maximum-margin solution, while this does not hold for losses with heavier tails. Within this family, for simple linear models we show that the optimal rates with fixed step size is indeed obtained for the commonly used exponentially tailed losses such as logistic loss. However, with a fixed step size the optimal convergence rate is extremely slow as $1/\log(t)$, as also proved in Soudry et al. (2018). For linear models with exponential loss, we further prove that the convergence rate could be improved to $\log (t) /\sqrt{t}$ by using aggressive step sizes that compensates for the rapidly vanishing gradients. Numerical results suggest this method might be useful for deep networks.

Convergence of Gradient Descent on Separable Data

TL;DR

The paper analyzes the implicit bias of gradient descent on separable data for a broad class of strictly monotone losses. It shows that losses with super-polynomial tails drive gradient descent, including deep linear networks, toward the maximum-margin separator, and it characterizes the margin convergence rates, establishing that exponential tails achieve the best fixed-step rate. It further demonstrates that using normalized gradient descent or variable step sizes can accelerate margin convergence to for exponential losses, with empirical evidence across synthetic and image classification tasks. The results illuminate why exponential-tailed losses, such as logistic loss, are effective and suggest practical acceleration strategies that may extend to more complex neural networks, while highlighting the role of tail behavior and depth in the implicit bias of optimization.

Abstract

We provide a detailed study on the implicit bias of gradient descent when optimizing loss functions with strictly monotone tails, such as the logistic loss, over separable datasets. We look at two basic questions: (a) what are the conditions on the tail of the loss function under which gradient descent converges in the direction of the maximum-margin separator? (b) how does the rate of margin convergence depend on the tail of the loss function and the choice of the step size? We show that for a large family of super-polynomial tailed losses, gradient descent iterates on linear networks of any depth converge in the direction of maximum-margin solution, while this does not hold for losses with heavier tails. Within this family, for simple linear models we show that the optimal rates with fixed step size is indeed obtained for the commonly used exponentially tailed losses such as logistic loss. However, with a fixed step size the optimal convergence rate is extremely slow as , as also proved in Soudry et al. (2018). For linear models with exponential loss, we further prove that the convergence rate could be improved to by using aggressive step sizes that compensates for the rapidly vanishing gradients. Numerical results suggest this method might be useful for deep networks.

Paper Structure

This paper contains 55 sections, 27 theorems, 355 equations, 7 figures, 1 table.

Key Result

Theorem 1

For almost all linearly separable datasets $\{{\mathbf{x}_n},y_n\}_{n=1}^N$, and any $\beta$-smooth $\mathcal{L}$ with a strictly monotone loss function $\ell$ (Definition def: l(u) assumptions), for which $-\ell^{\prime}$ has a tight exponential tail (Definition def: exponential tail), the gradient where the residual $\boldsymbol{\rho}\left(t\right)$ is bounded and $\hat{\mathbf{w}}$ is the follo

Figures (7)

  • Figure 1: Visualization of the convergence of GD in comparison to normalized GD in a synthetic logistic regression dataset in which the $L_{2}$ maximum-margin vector $\hat{\mathbf{w}}$ is precisely known. (A) The dataset (positive and negatives samples ($y=\pm1$) are respectively denoted by $'+'$ and $'\circ'$), max margin separating hyperplane (black line), and the solution of GD (dashed red) and normalized GD (dashed blue) after $10^5$ iterations. For both GD and Normalized GD, we show: (B) The norm of $\mathbf{w}\left(t\right)$, normalized so it would equal to $1$ at the last iteration, to facilitate comparison; (C) The training loss; and (D&E) the angle and margin gap of $\mathbf{w}\left(t\right)$ from $\hat{\mathbf{w}}$. As can be seen in panels (C-E), normalized GD converges to the maximum-margin separator significantly faster, as expected from our results. More details are given in appendix \ref{['sec:fig1 details']}.
  • Figure 2: Margin convergence plots for 2 (top) and 3 (bottom) layered linear networks on synthetic clustered data, trained with GD and normalized GD --- the latter provides significantly faster convergence.
  • Figure 3: MNIST digit classification with a 2-layer feedforward neural network. Training loss (dashed lines) stagnates with GD once gradients become small, while normalized GD keeps making progress. Normalized GD also achieves lower test error (solid lines).
  • Figure 4: Test performance of a Wide ResNet 28-4 on CIFAR-10, with $\eta = 2.0$, where normalized GD outperforms GD by absolute $2.17\%$. We plot 'best yet' test error: the lowest error seen up to iteration $t$. Unlike curves reported in wideresnet, progress stops early in training: there is no change in the 'best yet' test error after $t=2350$, even with the decays in learning rate. This suggests that regularization and/or momentum might be required to achieve better results.
  • Figure 5: a) Visualization of the synthetic dataset composed of 600 points: 300 labeled positive and 300 negative, again respectively denoted by $'+'$ and $'\circ'$. b) Convergence plots for a logistic regression trained with GD and Normalized GD for $5 \times 10^4$ epochs. Similarly to what is observed in Figure \ref{['fig:Synthetic-dataset']}, Normalized GD converges significantly faster to the max-margin solution.
  • ...and 2 more figures

Theorems & Definitions (62)

  • Definition 1
  • Definition 2
  • Theorem 1: Theorem 3 in soudry2017implicit, rephrased
  • Remark 1
  • Theorem 2
  • Remark 2
  • Remark 3
  • Remark 4
  • Theorem 3
  • Corollary 1
  • ...and 52 more