Loss landscapes and optimization in over-parameterized non-linear systems and neural networks
Chaoyue Liu, Libin Zhu, Mikhail Belkin
TL;DR
The paper develops a unifying PL$^*$ framework for loss landscapes in over-parameterized nonlinear systems, including deep neural networks, arguing that PL$^*$ holds on most of parameter space and guarantees solution existence and fast gradient-based optimization despite inherent non-convexity. It links PL$^*$ to the spectrum of the tangent kernel (NTK) and shows that wide neural networks satisfy PL$^*$ because their NTK remains well-conditioned in a neighborhood of initialization, aided by a small Hessian norm. The authors present general techniques to establish PL$^*$—uniform conditioning via Hessian norms, and preservation of conditioning under transformations and model compositions—and extend the results to an ε-relaxed PL$^*_ ext{ε}$ suitable for almost over-parameterized regimes. They further discuss implications for interpolation thresholds, transitions along optimization paths, and extensions to CNNs, ResNets, and manifold settings, highlighting the central role of condition numbers in guiding optimization methods beyond standard convexity assumptions.
Abstract
The success of deep learning is due, to a large extent, to the remarkable effectiveness of gradient-based optimization methods applied to large neural networks. The purpose of this work is to propose a modern view and a general mathematical framework for loss landscapes and efficient optimization in over-parameterized machine learning models and systems of non-linear equations, a setting that includes over-parameterized deep neural networks. Our starting observation is that optimization problems corresponding to such systems are generally not convex, even locally. We argue that instead they satisfy PL$^*$, a variant of the Polyak-Lojasiewicz condition on most (but not all) of the parameter space, which guarantees both the existence of solutions and efficient optimization by (stochastic) gradient descent (SGD/GD). The PL$^*$ condition of these systems is closely related to the condition number of the tangent kernel associated to a non-linear system showing how a PL$^*$-based non-linear theory parallels classical analyses of over-parameterized linear equations. We show that wide neural networks satisfy the PL$^*$ condition, which explains the (S)GD convergence to a global minimum. Finally we propose a relaxation of the PL$^*$ condition applicable to "almost" over-parameterized systems.
