From Sublinear to Linear: Fast Convergence in Deep Networks via Locally Polyak-Lojasiewicz Regions

Agnideep Aich; Ashit Baran Aich; Bruce Wade

From Sublinear to Linear: Fast Convergence in Deep Networks via Locally Polyak-Lojasiewicz Regions

Agnideep Aich, Ashit Baran Aich, Bruce Wade

TL;DR

Problem: non-convex deep-net losses show fast training in practice that global guarantees fail to explain. Approach: derive Locally Polyak-Lojasiewicz Regions (LPLRs) from Locally Quasi-Convex Regions (LQCRs) under a local NTK stability assumption, establishing a local PL-type bound with $\tfrac{1}{2}\|\nabla \mathcal{L}(\theta)\|^2 \ge \mu(\mathcal{L}(\theta)-\mathcal{L}_{\mathcal{R}}^*)$ and setting $\mu=\lambda_{\min}$; prove linear convergence of gradient descent within the region. Contributions: (i) local PL guarantee in finite-width networks; (ii) reliance on local NTK stability rather than global/ infinite-width limits; (iii) empirical validation on MNIST and CIFAR-10 showing PL-like scaling and linear-rate decay across controlled and realistic settings. Significance: provides an architecture-agnostic explanation for rapid optimization in finite-width deep nets and informs initialization, width, and learning-rate choices by linking local geometry to optimization speed.

Abstract

Gradient descent (GD) on deep neural network loss landscapes is non-convex, yet often converges far faster in practice than classical guarantees suggest. Prior work shows that within locally quasi-convex regions (LQCRs), GD converges to stationary points at sublinear rates, leaving the commonly observed near-exponential training dynamics unexplained. We show that, under a mild local Neural Tangent Kernel (NTK) stability assumption, the loss satisfies a PL-type error bound within these regions, yielding a Locally Polyak-Lojasiewicz Region (LPLR) in which the squared gradient norm controls the suboptimality gap. For properly initialized finite-width networks, we show that under local NTK stability this PL-type mechanism holds around initialization and establish linear convergence of GD as long as the iterates remain within the resulting LPLR. Empirically, we observe PL-like scaling and linear-rate loss decay in controlled full-batch training and in a ResNet-style CNN trained with mini-batch SGD on a CIFAR-10 subset, indicating that LPLR signatures can persist under modern architectures and stochastic optimization. Overall, the results connect local geometric structure, local NTK stability, and fast optimization rates in a finite-width setting.

From Sublinear to Linear: Fast Convergence in Deep Networks via Locally Polyak-Lojasiewicz Regions

TL;DR

Abstract

From Sublinear to Linear: Fast Convergence in Deep Networks via Locally Polyak-Lojasiewicz Regions

TL;DR

Abstract

Paper Structure

Table of Contents

Key Result

Figures (5)

Theorems & Definitions (8)