Table of Contents
Fetching ...

Gradient Descent on Logistic Regression with Non-Separable Data and Large Step Sizes

Si Yi Meng, Antonio Orvieto, Daniel Yiming Cao, Christopher De Sa

TL;DR

Although local convergence is guaranteed for all step sizes less than the critical step size, global convergence is not, and gradient descent may instead converge to a cycle depending on the initialization, and GD may instead converge to a cycle depending on the initialization.

Abstract

We study gradient descent (GD) dynamics on logistic regression problems with large, constant step sizes. For linearly-separable data, it is known that GD converges to the minimizer with arbitrarily large step sizes, a property which no longer holds when the problem is not separable. In fact, the behaviour can be much more complex -- a sequence of period-doubling bifurcations begins at the critical step size $2/λ$, where $λ$ is the largest eigenvalue of the Hessian at the solution. Using a smaller-than-critical step size guarantees convergence if initialized nearby the solution: but does this suffice globally? In one dimension, we show that a step size less than $1/λ$ suffices for global convergence. However, for all step sizes between $1/λ$ and the critical step size $2/λ$, one can construct a dataset such that GD converges to a stable cycle. In higher dimensions, this is actually possible even for step sizes less than $1/λ$. Our results show that although local convergence is guaranteed for all step sizes less than the critical step size, global convergence is not, and GD may instead converge to a cycle depending on the initialization.

Gradient Descent on Logistic Regression with Non-Separable Data and Large Step Sizes

TL;DR

Although local convergence is guaranteed for all step sizes less than the critical step size, global convergence is not, and gradient descent may instead converge to a cycle depending on the initialization, and GD may instead converge to a cycle depending on the initialization.

Abstract

We study gradient descent (GD) dynamics on logistic regression problems with large, constant step sizes. For linearly-separable data, it is known that GD converges to the minimizer with arbitrarily large step sizes, a property which no longer holds when the problem is not separable. In fact, the behaviour can be much more complex -- a sequence of period-doubling bifurcations begins at the critical step size , where is the largest eigenvalue of the Hessian at the solution. Using a smaller-than-critical step size guarantees convergence if initialized nearby the solution: but does this suffice globally? In one dimension, we show that a step size less than suffices for global convergence. However, for all step sizes between and the critical step size , one can construct a dataset such that GD converges to a stable cycle. In higher dimensions, this is actually possible even for step sizes less than . Our results show that although local convergence is guaranteed for all step sizes less than the critical step size, global convergence is not, and GD may instead converge to a cycle depending on the initialization.
Paper Structure (26 sections, 25 theorems, 130 equations, 10 figures, 1 table)

This paper contains 26 sections, 25 theorems, 130 equations, 10 figures, 1 table.

Key Result

Corollary 1

assmpt:individual-loss implies that

Figures (10)

  • Figure 1: Bifurcation diagrams on two binary classification datasets from the LIBSVM repository chang2011libsvm, both are non-separable. For each step size, we run GD for $T= 5\cdot 10^5$ iterations with $1024$ different random initializations of varying scales. Each point corresponds to the loss (first row) or the (scaled) largest eigenvalue of the Hessian (second row) evaluated at the final iterate $w_T$. When multiple points are visible, GD either converged to a cycle or is chaotic under that step size.
  • Figure 2: On the left is the bifurcation diagram on a synthetic dataset with $n=12$ and $d=4$. The $x_i$'s are generated from the standard Gaussian distribution, with uniformly random labels. In the middle, we plot the loss at each GD iteration when ran with $\eta=35$, for two different initializations $w_0=100\cdot \mathbf{1}$ and $w_0=0.001\cdot \mathbf{1}$, where $\mathbf{1}$ is the all $1$'s vector. On the right, we compute the power spectral density of the losses over $t$, which shows a period-$2$ cycle for the large initialization, while the small initialization run converged to a period-$8$ cycle.
  • Figure 3: The limit in \ref{['eq:limit-is-relu']} using the logistic loss as an example. Darker grey represents smaller values of $\epsilon$.
  • Figure 4: Illustration of our cycle construction. On the left, ${\mathcal{L}}$ is the logistic loss with data consisting of $250$ copies of $x_i=1$ and $200$ copies of $x_i=-1$. On top of this dataset we add $15$ copies of $x_i=70$ to get ${\mathcal{L}}_\epsilon$ on the right. The red star marks the minimizer $w^*$ and $w^*_\epsilon$ of the respective objective. Starting at the same $w_0$ and using the same $\gamma=1.5$, GD on ${\mathcal{L}}$ with $\eta = \gamma/{\mathcal{L}}"(w^*)$ converges to the minimizer, while GD on ${\mathcal{L}}_\epsilon$ with $\eta_\epsilon = \gamma/{\mathcal{L}}_\epsilon"(w^*_\epsilon)$ converges to a period-$7$ cycle.
  • Figure 5: For each $\gamma$, we construct a one-dimensional dataset such that GD on this problem with the logistic loss converges to a stable cycle under the step size $\eta=\gamma/{\mathcal{L}}"(w^*)$. One exception is the last column which seems to suggest GD can even be chaotic when $\gamma<2$. Figures in the first row show the loss evaluated at successive iterates, while the second row shows the power spectral density of the losses at the last $1024$ iterations. Details on each dataset are in \ref{['sec:exps']}.
  • ...and 5 more figures

Theorems & Definitions (44)

  • Corollary 1
  • Theorem 1
  • Theorem 2
  • Theorem 3
  • Corollary 2
  • proof
  • Theorem 3
  • proof
  • Theorem 3
  • proof
  • ...and 34 more