Table of Contents
Fetching ...

From Sublinear to Linear: Fast Convergence in Deep Networks via Locally Polyak-Lojasiewicz Regions

Agnideep Aich, Ashit Baran Aich, Bruce Wade

TL;DR

Problem: non-convex deep-net losses show fast training in practice that global guarantees fail to explain. Approach: derive Locally Polyak-Lojasiewicz Regions (LPLRs) from Locally Quasi-Convex Regions (LQCRs) under a local NTK stability assumption, establishing a local PL-type bound with $\tfrac{1}{2}\|\nabla \mathcal{L}(\theta)\|^2 \ge \mu(\mathcal{L}(\theta)-\mathcal{L}_{\mathcal{R}}^*)$ and setting $\mu=\lambda_{\min}$; prove linear convergence of gradient descent within the region. Contributions: (i) local PL guarantee in finite-width networks; (ii) reliance on local NTK stability rather than global/ infinite-width limits; (iii) empirical validation on MNIST and CIFAR-10 showing PL-like scaling and linear-rate decay across controlled and realistic settings. Significance: provides an architecture-agnostic explanation for rapid optimization in finite-width deep nets and informs initialization, width, and learning-rate choices by linking local geometry to optimization speed.

Abstract

Gradient descent (GD) on deep neural network loss landscapes is non-convex, yet often converges far faster in practice than classical guarantees suggest. Prior work shows that within locally quasi-convex regions (LQCRs), GD converges to stationary points at sublinear rates, leaving the commonly observed near-exponential training dynamics unexplained. We show that, under a mild local Neural Tangent Kernel (NTK) stability assumption, the loss satisfies a PL-type error bound within these regions, yielding a Locally Polyak-Lojasiewicz Region (LPLR) in which the squared gradient norm controls the suboptimality gap. For properly initialized finite-width networks, we show that under local NTK stability this PL-type mechanism holds around initialization and establish linear convergence of GD as long as the iterates remain within the resulting LPLR. Empirically, we observe PL-like scaling and linear-rate loss decay in controlled full-batch training and in a ResNet-style CNN trained with mini-batch SGD on a CIFAR-10 subset, indicating that LPLR signatures can persist under modern architectures and stochastic optimization. Overall, the results connect local geometric structure, local NTK stability, and fast optimization rates in a finite-width setting.

From Sublinear to Linear: Fast Convergence in Deep Networks via Locally Polyak-Lojasiewicz Regions

TL;DR

Problem: non-convex deep-net losses show fast training in practice that global guarantees fail to explain. Approach: derive Locally Polyak-Lojasiewicz Regions (LPLRs) from Locally Quasi-Convex Regions (LQCRs) under a local NTK stability assumption, establishing a local PL-type bound with and setting ; prove linear convergence of gradient descent within the region. Contributions: (i) local PL guarantee in finite-width networks; (ii) reliance on local NTK stability rather than global/ infinite-width limits; (iii) empirical validation on MNIST and CIFAR-10 showing PL-like scaling and linear-rate decay across controlled and realistic settings. Significance: provides an architecture-agnostic explanation for rapid optimization in finite-width deep nets and informs initialization, width, and learning-rate choices by linking local geometry to optimization speed.

Abstract

Gradient descent (GD) on deep neural network loss landscapes is non-convex, yet often converges far faster in practice than classical guarantees suggest. Prior work shows that within locally quasi-convex regions (LQCRs), GD converges to stationary points at sublinear rates, leaving the commonly observed near-exponential training dynamics unexplained. We show that, under a mild local Neural Tangent Kernel (NTK) stability assumption, the loss satisfies a PL-type error bound within these regions, yielding a Locally Polyak-Lojasiewicz Region (LPLR) in which the squared gradient norm controls the suboptimality gap. For properly initialized finite-width networks, we show that under local NTK stability this PL-type mechanism holds around initialization and establish linear convergence of GD as long as the iterates remain within the resulting LPLR. Empirically, we observe PL-like scaling and linear-rate loss decay in controlled full-batch training and in a ResNet-style CNN trained with mini-batch SGD on a CIFAR-10 subset, indicating that LPLR signatures can persist under modern architectures and stochastic optimization. Overall, the results connect local geometric structure, local NTK stability, and fast optimization rates in a finite-width setting.

Paper Structure

This paper contains 15 sections, 2 theorems, 13 equations, 5 figures, 1 table.

Key Result

Theorem 5.1

Under the same setup as prior work on LQCRs Aich2025 and holding Assumption assump:ntk, the LQCR $\mathcal{R}$ is also a Local Polyak-Lojasiewicz Region (LPLR). Specifically, the loss $\mathcal{L}$ satisfies the PL condition within $\mathcal{R}$ for a PL constant $\mu$ satisfying: where $\lambda_{\min}$ is the lower bound on the smallest eigenvalue of the NTK from Assumption assump:ntk.

Figures (5)

  • Figure 1: Training loss dynamics for the MLP on MNIST. The y-axis (suboptimality gap) is on a logarithmic scale. After an initial transient, the gap decays rapidly over a long training regime, consistent with the linear-rate behavior predicted by Theorem \ref{['thm:linear_conv']}.
  • Figure 2: Comparison of convergence dynamics for standard and enhanced initializations for the MLP on MNIST (semi-log scale). The enhanced initialization reduces the severity of the early transient, while both methods display very similar decay behavior across most of training.
  • Figure 3: Empirical relationship between squared gradient norm and suboptimality gap for the MLP on MNIST (log-log scale). The strong polynomial coupling (with fitted slope $\approx 1.14$ over the plotted range) is consistent with PL/KL-type error-bound behavior known to yield fast rates Necoara2015Richtarik2014.
  • Figure 4: Validation on a ResNet-style CNN trained on a 5-class CIFAR-10 subset with SGD (smoothed curves). (Left) The smoothed suboptimality gap decays approximately linearly on a semi-log scale over an extended regime, indicating linear-rate behavior in the explored region. (Right) The log-log plot of squared gradient norm versus suboptimality gap exhibits an approximately power-law trend; the fitted slope is $\approx 0.29$, providing evidence of a PL/KL-like error-bound relationship in this complex, stochastic setting.
  • Figure 5: Impact of network width on the final training loss achieved after 250 epochs. The trend is monotone: larger widths consistently yield lower final loss.

Theorems & Definitions (8)

  • Definition 4.1: Locally Polyak-Lojasiewicz (LPL) Region
  • Definition 4.2: Locally Quasi-Convex Region Aich2025
  • Remark 4.3: On the descent condition orientation
  • Theorem 5.1: Existence of LPLRs
  • proof : Proof of Theorem \ref{['thm:lplr_existence']}
  • Remark 5.3
  • Theorem 5.4: Linear Convergence of Gradient Descent
  • proof : Proof of Theorem \ref{['thm:linear_conv']}