Table of Contents
Fetching ...

Zero loss guarantees and explicit minimizers for generic overparametrized Deep Learning networks

Thomas Chen, Andrew G. Moore

TL;DR

The paper addresses when highly overparameterized deep networks can achieve zero training loss on generic data and how depth affects training dynamics. It provides explicit constructions of zero-loss minimizers that do not rely on gradient descent, and analyzes Jacobian-rank phenomena that can impede gradient-based optimization in very deep, wide networks. A central theme is the dichotomy between underparametrized and overparametrized regimes: zero loss is generically attainable in the overparametrized case under mild activation and data conditions, while underparametrized networks require special data structure for exact fitting. These insights clarify the landscape of gradient-flow dynamics in wide networks and connect solvability to the rank properties of the training Jacobian.

Abstract

We determine sufficient conditions for overparametrized deep learning (DL) networks to guarantee the attainability of zero loss in the context of supervised learning, for the $\mathcal{L}^2$ cost and {\em generic} training data. We present an explicit construction of the zero loss minimizers without invoking gradient descent. On the other hand, we point out that increase of depth can deteriorate the efficiency of cost minimization using a gradient descent algorithm by analyzing the conditions for rank loss of the training Jacobian. Our results clarify key aspects on the dichotomy between zero loss reachability in underparametrized versus overparametrized DL.

Zero loss guarantees and explicit minimizers for generic overparametrized Deep Learning networks

TL;DR

The paper addresses when highly overparameterized deep networks can achieve zero training loss on generic data and how depth affects training dynamics. It provides explicit constructions of zero-loss minimizers that do not rely on gradient descent, and analyzes Jacobian-rank phenomena that can impede gradient-based optimization in very deep, wide networks. A central theme is the dichotomy between underparametrized and overparametrized regimes: zero loss is generically attainable in the overparametrized case under mild activation and data conditions, while underparametrized networks require special data structure for exact fitting. These insights clarify the landscape of gradient-flow dynamics in wide networks and connect solvability to the rank properties of the training Jacobian.

Abstract

We determine sufficient conditions for overparametrized deep learning (DL) networks to guarantee the attainability of zero loss in the context of supervised learning, for the cost and {\em generic} training data. We present an explicit construction of the zero loss minimizers without invoking gradient descent. On the other hand, we point out that increase of depth can deteriorate the efficiency of cost minimization using a gradient descent algorithm by analyzing the conditions for rank loss of the training Jacobian. Our results clarify key aspects on the dichotomy between zero loss reachability in underparametrized versus overparametrized DL.

Paper Structure

This paper contains 12 sections, 9 theorems, 27 equations.

Key Result

Theorem 2.1

Assume that $M=M_0>N$, and that all hidden layers have equal dimension, $M_\ell=M$, for all $\ell=1,\dots,L$, and that $Q\leq M$ in the output layer. If we take any map $\sigma : {\mathbb R}^M \longrightarrow {\mathbb R}^M$ which is a local diffeomorphism at at least one point (this includes most co

Theorems & Definitions (27)

  • Theorem 2.1
  • proof
  • Theorem 2.2
  • proof
  • Remark 1
  • Remark 2
  • Definition 3.1
  • Definition 4.1: Strongly Overparameterized
  • Definition 4.2: Broadcast Vectorized Tensor Product
  • Lemma 1
  • ...and 17 more