Gradient Descent on Logistic Regression with Non-Separable Data and Large Step Sizes

Si Yi Meng; Antonio Orvieto; Daniel Yiming Cao; Christopher De Sa

Gradient Descent on Logistic Regression with Non-Separable Data and Large Step Sizes

Si Yi Meng, Antonio Orvieto, Daniel Yiming Cao, Christopher De Sa

TL;DR

Although local convergence is guaranteed for all step sizes less than the critical step size, global convergence is not, and gradient descent may instead converge to a cycle depending on the initialization, and GD may instead converge to a cycle depending on the initialization.

Abstract

We study gradient descent (GD) dynamics on logistic regression problems with large, constant step sizes. For linearly-separable data, it is known that GD converges to the minimizer with arbitrarily large step sizes, a property which no longer holds when the problem is not separable. In fact, the behaviour can be much more complex -- a sequence of period-doubling bifurcations begins at the critical step size $2/λ$, where $λ$ is the largest eigenvalue of the Hessian at the solution. Using a smaller-than-critical step size guarantees convergence if initialized nearby the solution: but does this suffice globally? In one dimension, we show that a step size less than $1/λ$ suffices for global convergence. However, for all step sizes between $1/λ$ and the critical step size $2/λ$, one can construct a dataset such that GD converges to a stable cycle. In higher dimensions, this is actually possible even for step sizes less than $1/λ$. Our results show that although local convergence is guaranteed for all step sizes less than the critical step size, global convergence is not, and GD may instead converge to a cycle depending on the initialization.

Gradient Descent on Logistic Regression with Non-Separable Data and Large Step Sizes

TL;DR

Abstract

, where

is the largest eigenvalue of the Hessian at the solution. Using a smaller-than-critical step size guarantees convergence if initialized nearby the solution: but does this suffice globally? In one dimension, we show that a step size less than

suffices for global convergence. However, for all step sizes between

and the critical step size

, one can construct a dataset such that GD converges to a stable cycle. In higher dimensions, this is actually possible even for step sizes less than

. Our results show that although local convergence is guaranteed for all step sizes less than the critical step size, global convergence is not, and GD may instead converge to a cycle depending on the initialization.

Paper Structure (26 sections, 25 theorems, 130 equations, 10 figures, 1 table)

This paper contains 26 sections, 25 theorems, 130 equations, 10 figures, 1 table.

Introduction
Background
Period-doubling bifurcation and chaos
A toy dataset
Technical setup
One-dimensional case
Higher dimensions
Discussion
Proofs in one dimension
Convergence under the stable step size
Cycle construction below the critical step size
Proofs in higher dimensions
Cycle construction in two dimensions
Miscellaneous results
Properties of the logistic loss and the squareplus loss
...and 11 more sections

Key Result

Corollary 1

assmpt:individual-loss implies that

Figures (10)

Figure 1: Bifurcation diagrams on two binary classification datasets from the LIBSVM repository chang2011libsvm, both are non-separable. For each step size, we run GD for $T= 5\cdot 10^5$ iterations with $1024$ different random initializations of varying scales. Each point corresponds to the loss (first row) or the (scaled) largest eigenvalue of the Hessian (second row) evaluated at the final iterate $w_T$. When multiple points are visible, GD either converged to a cycle or is chaotic under that step size.
Figure 2: On the left is the bifurcation diagram on a synthetic dataset with $n=12$ and $d=4$. The $x_i$'s are generated from the standard Gaussian distribution, with uniformly random labels. In the middle, we plot the loss at each GD iteration when ran with $\eta=35$, for two different initializations $w_0=100\cdot \mathbf{1}$ and $w_0=0.001\cdot \mathbf{1}$, where $\mathbf{1}$ is the all $1$'s vector. On the right, we compute the power spectral density of the losses over $t$, which shows a period-$2$ cycle for the large initialization, while the small initialization run converged to a period-$8$ cycle.
Figure 3: The limit in \ref{['eq:limit-is-relu']} using the logistic loss as an example. Darker grey represents smaller values of $\epsilon$.
Figure 4: Illustration of our cycle construction. On the left, ${\mathcal{L}}$ is the logistic loss with data consisting of $250$ copies of $x_i=1$ and $200$ copies of $x_i=-1$. On top of this dataset we add $15$ copies of $x_i=70$ to get ${\mathcal{L}}_\epsilon$ on the right. The red star marks the minimizer $w^*$ and $w^*_\epsilon$ of the respective objective. Starting at the same $w_0$ and using the same $\gamma=1.5$, GD on ${\mathcal{L}}$ with $\eta = \gamma/{\mathcal{L}}"(w^*)$ converges to the minimizer, while GD on ${\mathcal{L}}_\epsilon$ with $\eta_\epsilon = \gamma/{\mathcal{L}}_\epsilon"(w^*_\epsilon)$ converges to a period-$7$ cycle.
Figure 5: For each $\gamma$, we construct a one-dimensional dataset such that GD on this problem with the logistic loss converges to a stable cycle under the step size $\eta=\gamma/{\mathcal{L}}"(w^*)$. One exception is the last column which seems to suggest GD can even be chaotic when $\gamma<2$. Figures in the first row show the loss evaluated at successive iterates, while the second row shows the power spectral density of the losses at the last $1024$ iterations. Details on each dataset are in \ref{['sec:exps']}.
...and 5 more figures

Theorems & Definitions (44)

Corollary 1
Theorem 1
Theorem 2
Theorem 3
Corollary 2
proof
Theorem 3
proof
Theorem 3
proof
...and 34 more

Gradient Descent on Logistic Regression with Non-Separable Data and Large Step Sizes

TL;DR

Abstract

Gradient Descent on Logistic Regression with Non-Separable Data and Large Step Sizes

Authors

TL;DR

Abstract

Table of Contents

Key Result

Figures (10)

Theorems & Definitions (44)