Table of Contents
Fetching ...

Loss landscapes and optimization in over-parameterized non-linear systems and neural networks

Chaoyue Liu, Libin Zhu, Mikhail Belkin

TL;DR

The paper develops a unifying PL$^*$ framework for loss landscapes in over-parameterized nonlinear systems, including deep neural networks, arguing that PL$^*$ holds on most of parameter space and guarantees solution existence and fast gradient-based optimization despite inherent non-convexity. It links PL$^*$ to the spectrum of the tangent kernel (NTK) and shows that wide neural networks satisfy PL$^*$ because their NTK remains well-conditioned in a neighborhood of initialization, aided by a small Hessian norm. The authors present general techniques to establish PL$^*$—uniform conditioning via Hessian norms, and preservation of conditioning under transformations and model compositions—and extend the results to an ε-relaxed PL$^*_ ext{ε}$ suitable for almost over-parameterized regimes. They further discuss implications for interpolation thresholds, transitions along optimization paths, and extensions to CNNs, ResNets, and manifold settings, highlighting the central role of condition numbers in guiding optimization methods beyond standard convexity assumptions.

Abstract

The success of deep learning is due, to a large extent, to the remarkable effectiveness of gradient-based optimization methods applied to large neural networks. The purpose of this work is to propose a modern view and a general mathematical framework for loss landscapes and efficient optimization in over-parameterized machine learning models and systems of non-linear equations, a setting that includes over-parameterized deep neural networks. Our starting observation is that optimization problems corresponding to such systems are generally not convex, even locally. We argue that instead they satisfy PL$^*$, a variant of the Polyak-Lojasiewicz condition on most (but not all) of the parameter space, which guarantees both the existence of solutions and efficient optimization by (stochastic) gradient descent (SGD/GD). The PL$^*$ condition of these systems is closely related to the condition number of the tangent kernel associated to a non-linear system showing how a PL$^*$-based non-linear theory parallels classical analyses of over-parameterized linear equations. We show that wide neural networks satisfy the PL$^*$ condition, which explains the (S)GD convergence to a global minimum. Finally we propose a relaxation of the PL$^*$ condition applicable to "almost" over-parameterized systems.

Loss landscapes and optimization in over-parameterized non-linear systems and neural networks

TL;DR

The paper develops a unifying PL framework for loss landscapes in over-parameterized nonlinear systems, including deep neural networks, arguing that PL holds on most of parameter space and guarantees solution existence and fast gradient-based optimization despite inherent non-convexity. It links PL to the spectrum of the tangent kernel (NTK) and shows that wide neural networks satisfy PL because their NTK remains well-conditioned in a neighborhood of initialization, aided by a small Hessian norm. The authors present general techniques to establish PL—uniform conditioning via Hessian norms, and preservation of conditioning under transformations and model compositions—and extend the results to an ε-relaxed PL suitable for almost over-parameterized regimes. They further discuss implications for interpolation thresholds, transitions along optimization paths, and extensions to CNNs, ResNets, and manifold settings, highlighting the central role of condition numbers in guiding optimization methods beyond standard convexity assumptions.

Abstract

The success of deep learning is due, to a large extent, to the remarkable effectiveness of gradient-based optimization methods applied to large neural networks. The purpose of this work is to propose a modern view and a general mathematical framework for loss landscapes and efficient optimization in over-parameterized machine learning models and systems of non-linear equations, a setting that includes over-parameterized deep neural networks. Our starting observation is that optimization problems corresponding to such systems are generally not convex, even locally. We argue that instead they satisfy PL, a variant of the Polyak-Lojasiewicz condition on most (but not all) of the parameter space, which guarantees both the existence of solutions and efficient optimization by (stochastic) gradient descent (SGD/GD). The PL condition of these systems is closely related to the condition number of the tangent kernel associated to a non-linear system showing how a PL-based non-linear theory parallels classical analyses of over-parameterized linear equations. We show that wide neural networks satisfy the PL condition, which explains the (S)GD convergence to a global minimum. Finally we propose a relaxation of the PL condition applicable to "almost" over-parameterized systems.

Paper Structure

This paper contains 39 sections, 22 theorems, 93 equations, 5 figures.

Key Result

Proposition 1

If map ${\mathcal{F}}$ is $L_{\mathcal{F}}$-Lipschitz, then $\|K({\mathbf{w}})\|_2 \le L_{\mathcal{F}}^2$.

Figures (5)

  • Figure 1: Panel (a): Loss landscape is locally convex at local minima. Panel (b): Loss landscape incompatible with local convexity as the set of global minima is not locally linear.
  • Figure 2: The loss function ${\mathcal{L}}({\mathbf{w}})$ is $\mu$-PL$^*$ inside the shaded domain. Singular set corresponds to parameters ${\mathbf{w}}$ with degenerate tangent kernel $K({\mathbf{w}})$. Every ball of radius $R=O(1/\mu)$ within the shaded set intersects with the set of global minima of ${\mathcal{L}}({\mathbf{w}})$, i.e., solutions to ${\mathcal{F}}({\mathbf{w}})={\mathbf{y}}$.
  • Figure 3: The loss landscape of "almost over-parameterized" systems. The landscape looks over-parameterized except for the grey area where the loss is small. Local minima of the loss are contained there.
  • Figure 4: $\mu$-PL$^*$ domain and the singular set. We expect points away from the singular set to satisfy $\mu$-PL$^*$ condition for sufficiently small $\mu$.
  • Figure 5: The loss landscape of under-parameterized systems. In the set $\mathcal{S}_\epsilon$, where the loss is larger than $\epsilon$, PL$^*$ still holds. Beyond that, the loss landscape can be arbitrary, which is the grey area, and PL$^*$ doesn't hold any more.

Theorems & Definitions (51)

  • Definition 1: Lipschitz continuity
  • Remark 1
  • Proposition 1
  • Definition 2: Smoothness
  • Proposition 2: Local non-convexity
  • Remark 2
  • Definition 3: Uniform conditioning
  • Theorem 1: Uniform conditioning $\Rightarrow$ PL$^*$ condition
  • proof
  • Proposition 3
  • ...and 41 more