Table of Contents
Fetching ...

Approximation and Gradient Descent Training with Neural Networks

G. Welper

TL;DR

This paper establishes analogous results for networks trained by gradient descent by extending a neural tangent kernel optimization argument to an under-parametrized regime and showing direct approximation bounds for networks trained by gradient flow.

Abstract

It is well understood that neural networks with carefully hand-picked weights provide powerful function approximation and that they can be successfully trained in over-parametrized regimes. Since over-parametrization ensures zero training error, these two theories are not immediately compatible. Recent work uses the smoothness that is required for approximation results to extend a neural tangent kernel (NTK) optimization argument to an under-parametrized regime and show direct approximation bounds for networks trained by gradient flow. Since gradient flow is only an idealization of a practical method, this paper establishes analogous results for networks trained by gradient descent.

Approximation and Gradient Descent Training with Neural Networks

TL;DR

This paper establishes analogous results for networks trained by gradient descent by extending a neural tangent kernel optimization argument to an under-parametrized regime and showing direct approximation bounds for networks trained by gradient flow.

Abstract

It is well understood that neural networks with carefully hand-picked weights provide powerful function approximation and that they can be successfully trained in over-parametrized regimes. Since over-parametrization ensures zero training error, these two theories are not immediately compatible. Recent work uses the smoothness that is required for approximation results to extend a neural tangent kernel (NTK) optimization argument to an under-parametrized regime and show direct approximation bounds for networks trained by gradient flow. Since gradient flow is only an idealization of a practical method, this paper establishes analogous results for networks trained by gradient descent.
Paper Structure (27 sections, 16 theorems, 83 equations)

This paper contains 27 sections, 16 theorems, 83 equations.

Key Result

Theorem 2.1

Assume we train the shallow network $f_\theta$, defined in eq:setup:network, with gradient descent eq:setup:gd applied to the $L_2(D)$ loss eq:setup:loss, with learning rate $\gamma \lesssim h \sqrt{m}$ and for some $0 < {s} < 1/2$ and some constant $c_h$ that may depend on the initial error $\|f_{\theta^0} - f\|_0$. Then, with $\kappa^n := f_{\theta^n} - f$ and probability at least $1 - \frac{c}

Theorems & Definitions (26)

  • Theorem 2.1
  • Theorem 2.2
  • Theorem 3.1
  • Lemma 3.2
  • proof
  • Lemma 3.3
  • proof
  • proof : Proof of Theorem \ref{['th:general:convergence']}
  • Lemma 4.1
  • proof
  • ...and 16 more