Table of Contents
Fetching ...

Convergence of stochastic gradient descent under a local Lojasiewicz condition for deep neural networks

Jing An, Jianfeng Lu

TL;DR

This work studies SGD convergence for non-convex losses in finitely wide neural networks by extending Chatterjee's local Łojasiewicz framework to the stochastic setting. It imposes a local Łojasiewicz condition, a local structural level-set assumption, and a noise model in which the gradient noise scales with the objective, deriving step-size and probabilistic convergence guarantees. The main contribution is a high-probability result: with initialization in a region where the local_PL holds and a sufficiently small loss, SGD remains in a local ball and converges to a zero minimum, with a quantified contraction rate and almost-sure convergence under the event E∞(R−1); the analysis also clarifies limitations by showing convergence can fail under bounded noise. The paper further shows that certain finitely wide neural networks satisfy the needed assumptions, providing practical relevance for understanding SGD behavior beyond idealized infinite-width regimes.

Abstract

We study the convergence of stochastic gradient descent (SGD) for non-convex objective functions. We establish the local convergence with positive probability under the local Łojasiewicz condition introduced by Chatterjee in \cite{chatterjee2022convergence} and an additional local structural assumption of the loss function landscape. A key component of our proof is to ensure that the whole trajectories of SGD stay inside the local region with a positive probability. We also provide examples of neural networks with finite widths such that our assumptions hold.

Convergence of stochastic gradient descent under a local Lojasiewicz condition for deep neural networks

TL;DR

This work studies SGD convergence for non-convex losses in finitely wide neural networks by extending Chatterjee's local Łojasiewicz framework to the stochastic setting. It imposes a local Łojasiewicz condition, a local structural level-set assumption, and a noise model in which the gradient noise scales with the objective, deriving step-size and probabilistic convergence guarantees. The main contribution is a high-probability result: with initialization in a region where the local_PL holds and a sufficiently small loss, SGD remains in a local ball and converges to a zero minimum, with a quantified contraction rate and almost-sure convergence under the event E∞(R−1); the analysis also clarifies limitations by showing convergence can fail under bounded noise. The paper further shows that certain finitely wide neural networks satisfy the needed assumptions, providing practical relevance for understanding SGD behavior beyond idealized infinite-width regimes.

Abstract

We study the convergence of stochastic gradient descent (SGD) for non-convex objective functions. We establish the local convergence with positive probability under the local Łojasiewicz condition introduced by Chatterjee in \cite{chatterjee2022convergence} and an additional local structural assumption of the loss function landscape. A key component of our proof is to ensure that the whole trajectories of SGD stay inside the local region with a positive probability. We also provide examples of neural networks with finite widths such that our assumptions hold.
Paper Structure (6 sections, 6 theorems, 68 equations)

This paper contains 6 sections, 6 theorems, 68 equations.

Key Result

Lemma 2.1

Suppose that $F:\mathbb R^d\to \mathbb R$ is a non-negative function. Assume $\nabla F$ is Lipschitz continuous in a compact set $\mathcal{K}$ with the constant $C_L$, and there exists $\bar{C}>0$ such that $\max_{\theta\in\mathcal{K}}|\nabla F(\theta)| = \bar{C}$. Then there exists a compact set $\

Theorems & Definitions (13)

  • Lemma 2.1
  • Theorem 3.1
  • proof
  • Lemma 4.1: chung1954stochastic, Lemma 1 and Lemma 4
  • Lemma 4.2
  • proof
  • Remark 4.3
  • Theorem 4.4
  • proof
  • Remark 1.1
  • ...and 3 more