Geometric structure of shallow neural networks and constructive ${\mathcal L}^2$ cost minimization

Thomas Chen; Patrícia Muñoz Ewald

Geometric structure of shallow neural networks and constructive ${\mathcal L}^2$ cost minimization

Thomas Chen, Patrícia Muñoz Ewald

TL;DR

This paper proves an upper bound on the minimum of the cost function of order $O(\delta_P)$ where $\delta_P$ measures the signal-to-noise ratio of training data, and defines an exact degenerate local minimum of the cost function.

Abstract

In this paper, we approach the problem of cost (loss) minimization in underparametrized shallow ReLU networks through the explicit construction of upper bounds which appeal to the structure of classification data, without use of gradient descent. A key focus is on elucidating the geometric structure of approximate and precise minimizers. We consider an $L^2$ cost function, input space $\mathbb{R}^M$, output space ${\mathbb R}^Q$ with $Q\leq M$, and training input sample size that can be arbitrarily large. We prove an upper bound on the minimum of the cost function of order $O(δ_P)$ where $δ_P$ measures the signal-to-noise ratio of training data. In the special case $M=Q$, we explicitly determine an exact degenerate local minimum of the cost function, and show that the sharp value differs from the upper bound obtained for $Q\leq M$ by a relative error $O(δ_P^2)$. The proof of the upper bound yields a constructively trained network; we show that it metrizes a particular $Q$-dimensional subspace in the input space ${\mathbb R}^M$. We comment on the characterization of the global minimum of the cost function in the given context.

Geometric structure of shallow neural networks and constructive ${\mathcal L}^2$ cost minimization

TL;DR

This paper proves an upper bound on the minimum of the cost function of order

where

measures the signal-to-noise ratio of training data, and defines an exact degenerate local minimum of the cost function.

Abstract

cost function, input space

, output space

with

, and training input sample size that can be arbitrarily large. We prove an upper bound on the minimum of the cost function of order

where

measures the signal-to-noise ratio of training data. In the special case

, we explicitly determine an exact degenerate local minimum of the cost function, and show that the sharp value differs from the upper bound obtained for

by a relative error

. The proof of the upper bound yields a constructively trained network; we show that it metrizes a particular

-dimensional subspace in the input space

. We comment on the characterization of the global minimum of the cost function in the given context.

Paper Structure (14 sections, 4 theorems, 203 equations, 2 figures, 1 table)

This paper contains 14 sections, 4 theorems, 203 equations, 2 figures, 1 table.

Introduction
Related work
Related work by the authors
Outline of paper
Definition of the mathematical model
Statement of Main Results
Upper bound on minimum of cost function for M > Q
Exact degenerate local minimum in the case M=Q
Geometric interpretation
Dependence on truncation
Experiments
Proof of Theorem \ref{['thm-cC-uppbd-1']}
Proof of Theorem \ref{['thm-cC-uppbd-2']}
Proof of Theorem \ref{['thm-DL-geometry-1']}

Key Result

Theorem 3.1

Let $Q\leq M \leq QM$. Assume that $R\in O(M)$ diagonalizes $P,P^\perp$, and let $\beta_1\geq 2\max_{j,i}|x_{0,j,i}|$. Let ${\mathcal{C}}[W_i^*,b_i^*]$ be the cost function evaluated for the trained shallow network defined by the following weights and biases, and Moreover, $b_1^*=P_R b_1^*+P_R^\perp b_1^*$ with for $u_M\in{\mathbb R}^M$ as in eq-uM-def-1-0-0, and Then, the minimum of the cost

Figures (2)

Figure 1: (Average of) initial ($C_{init}$) and final ($C_{final}$) cost of randomly initialized neural network(s) trained to classify Gaussian mixture data with $Q$ classes, for different (fixed in each plot) cluster standard deviations, plotted against the bound \ref{['boundexp']} computed for each data set.
Figure 2: (Average of) initial ($C_{init}$) and final ($C_{final}$) cost of randomly initialized neural network(s) trained to classify Gaussian mixture data with $Q$ classes, plotted against the bound \ref{['boundexp']} computed for each data set.

Theorems & Definitions (8)

Remark 1.1
Theorem 3.1
Theorem 3.2
Theorem 3.3
Definition 3.4
Theorem 3.5
proof
Remark 3.6

Geometric structure of shallow neural networks and constructive ${\mathcal L}^2$ cost minimization

TL;DR

Abstract

Geometric structure of shallow neural networks and constructive ${\mathcal L}^2$ cost minimization

Authors

TL;DR

Abstract

Table of Contents

Key Result

Figures (2)

Theorems & Definitions (8)