Table of Contents
Fetching ...

On non-approximability of zero loss global ${\mathcal L}^2$ minimizers by gradient descent in Deep Learning

Thomas Chen, Patricia Muñoz Ewald

TL;DR

The paper investigates why zero ${ m mathcal{L}}^2$ loss minimizers are generally unavailable for underparametrized ReLU networks under gradient descent. It connects parameter-space dynamics to output-space dynamics through the Jacobian $D[\boldsymbol{\theta}]$ and the neural tangent kernel $D[\boldsymbol{\theta}]D^T[\boldsymbol{\theta}]$, and compares this to a simple gradient-flow model. It shows that in the overparametrized regime ($K\ge QN$), a uniform positive lower bound on the NTK ensures convergence to the global zero-loss solution, whereas in the underparametrized regime ($K<QN$) the dynamics induces a constrained evolution, making zero loss at stationary points generally impossible. Crucially, for generic training-data distributions, zero-loss minimizers do not exist in underparametrized ReLU networks; zero-loss solutions only arise with non-generic (clustered) input distributions, aligning with prior constructive minimizers. These results illuminate how data geometry and network capacity shape optimization dynamics and the feasibility of exact data fitting in deep learning.

Abstract

We analyze geometric aspects of the gradient descent algorithm in Deep Learning (DL), and give a detailed discussion of the circumstance that in underparametrized DL networks, zero loss minimization can generically not be attained. As a consequence, we conclude that the distribution of training inputs must necessarily be non-generic in order to produce zero loss minimizers, both for the method constructed in [Chen-Munoz Ewald 2023, 2024], or for gradient descent [Chen 2025] (which assume clustering of training data).

On non-approximability of zero loss global ${\mathcal L}^2$ minimizers by gradient descent in Deep Learning

TL;DR

The paper investigates why zero loss minimizers are generally unavailable for underparametrized ReLU networks under gradient descent. It connects parameter-space dynamics to output-space dynamics through the Jacobian and the neural tangent kernel , and compares this to a simple gradient-flow model. It shows that in the overparametrized regime (), a uniform positive lower bound on the NTK ensures convergence to the global zero-loss solution, whereas in the underparametrized regime () the dynamics induces a constrained evolution, making zero loss at stationary points generally impossible. Crucially, for generic training-data distributions, zero-loss minimizers do not exist in underparametrized ReLU networks; zero-loss solutions only arise with non-generic (clustered) input distributions, aligning with prior constructive minimizers. These results illuminate how data geometry and network capacity shape optimization dynamics and the feasibility of exact data fitting in deep learning.

Abstract

We analyze geometric aspects of the gradient descent algorithm in Deep Learning (DL), and give a detailed discussion of the circumstance that in underparametrized DL networks, zero loss minimization can generically not be attained. As a consequence, we conclude that the distribution of training inputs must necessarily be non-generic in order to produce zero loss minimizers, both for the method constructed in [Chen-Munoz Ewald 2023, 2024], or for gradient descent [Chen 2025] (which assume clustering of training data).
Paper Structure (7 sections, 3 theorems, 27 equations)

This paper contains 7 sections, 3 theorems, 27 equations.

Key Result

Theorem 1.3

Assume that ${\underline{x}}[\underline{\theta}_*]$ is a stationary solution, Then, it corresponds to a global minimum of the ${\mathcal{L}}^2$ cost, if and only if $\nabla_{{\underline{x}}}{\mathcal{C}}[{\underline{x}}[\underline{\theta}_*]]=0$. A necessary condition for $\nabla_{{\underline{x}}}{\mathcal{C}}[{\underline{x}}[\underline{\theta}_*]]=0$ to follow from eq-statsol-1-0 is that has f

Theorems & Definitions (9)

  • Remark 1.1
  • Remark 1.2
  • Theorem 1.3
  • proof
  • Remark 1.4
  • Theorem 1.5
  • proof
  • Theorem 1.6
  • proof