Table of Contents
Fetching ...

Risk and parameter convergence of logistic regression

Ziwei Ji, Matus Telgarsky

TL;DR

This work analyzes gradient descent on empirical logistic (and exponential) risk for general data, showing that iterates bias toward a unique ray consisting of a maximum-margin direction in a linearly separable subspace and a bounded-offset minimizer on the remaining data. By decomposing the data into S and S^⊥ and identifying the corresponding minimizers v̄ and ū, the authors prove both risk convergence to the global infimum and parameter convergence toward the ray, with explicit rates: direction convergence at O(ln ln t / ln t) and offset convergence at O((ln t)^2 / sqrt t) under standard step-size choices, plus corresponding refinements under separable and non-separable regimes. The analysis leverages a Fenchel-Young framework and refined smoothness arguments, connecting to AdaBoost margins and perceptron-style bounds, and yields a principled account of implicit bias and implicit regularization in gradient descent for logistic regression. The results extend understanding of unbounded optimization in high-dimensional settings and provide precise descriptions of the asymptotic behavior of gradient-based learning on non-strongly-convex problems.

Abstract

Gradient descent, when applied to the task of logistic regression, outputs iterates which are biased to follow a unique ray defined by the data. The direction of this ray is the maximum margin predictor of a maximal linearly separable subset of the data; the gradient descent iterates converge to this ray in direction at the rate $\mathcal{O}(\ln\ln t / \ln t)$. The ray does not pass through the origin in general, and its offset is the bounded global optimum of the risk over the remaining data; gradient descent recovers this offset at a rate $\mathcal{O}((\ln t)^2 / \sqrt{t})$.

Risk and parameter convergence of logistic regression

TL;DR

This work analyzes gradient descent on empirical logistic (and exponential) risk for general data, showing that iterates bias toward a unique ray consisting of a maximum-margin direction in a linearly separable subspace and a bounded-offset minimizer on the remaining data. By decomposing the data into S and S^⊥ and identifying the corresponding minimizers v̄ and ū, the authors prove both risk convergence to the global infimum and parameter convergence toward the ray, with explicit rates: direction convergence at O(ln ln t / ln t) and offset convergence at O((ln t)^2 / sqrt t) under standard step-size choices, plus corresponding refinements under separable and non-separable regimes. The analysis leverages a Fenchel-Young framework and refined smoothness arguments, connecting to AdaBoost margins and perceptron-style bounds, and yields a principled account of implicit bias and implicit regularization in gradient descent for logistic regression. The results extend understanding of unbounded optimization in high-dimensional settings and provide precise descriptions of the asymptotic behavior of gradient-based learning on non-strongly-convex problems.

Abstract

Gradient descent, when applied to the task of logistic regression, outputs iterates which are biased to follow a unique ray defined by the data. The direction of this ray is the maximum margin predictor of a maximal linearly separable subset of the data; the gradient descent iterates converge to this ray in direction at the rate . The ray does not pass through the origin in general, and its offset is the bounded global optimum of the risk over the remaining data; gradient descent recovers this offset at a rate .

Paper Structure

This paper contains 23 sections, 19 theorems, 93 equations, 4 figures.

Key Result

theorem 1

Let examples $((x_i,y_i))_{i=1}^n$ be given satisfying $|x_iy_i|\leq 1$, along with a loss $\ell \in\{\ell_{\log}, \exp\}$, with corresponding risk $\cR$ as above. Consider gradient descent iterates $(w_j)_{j\geq 0}$ as above, with $w_0 = 0$.

Figures (4)

  • Figure 1: Separable.
  • Figure 2: Strongly convex.
  • Figure 3: Mixed data.
  • Figure 4: The general case.

Theorems & Definitions (38)

  • theorem 1: name=Simplification of \ref{['fact:struct', 'thm:risk_converge', 'fact:min_norm']}
  • theorem 2
  • theorem 3
  • lemma 1
  • lemma 2
  • proof
  • lemma 3
  • theorem 4
  • lemma 4
  • proof
  • ...and 28 more