Towards optimal hierarchical training of neural networks

Michael Feischl; Alexander Rieder; Fabian Zehetgruber

Towards optimal hierarchical training of neural networks

Michael Feischl, Alexander Rieder, Fabian Zehetgruber

TL;DR

A hierarchical training algorithm for standard feed-forward neural networks that adaptively extends the network architecture as soon as the optimization reaches a stationary point is proposed and computable indicators which judge the optimality of the training state of a given network are obtained.

Abstract

We propose a hierarchical training algorithm for standard feed-forward neural networks that adaptively extends the network architecture as soon as the optimization reaches a stationary point. By solving small (low-dimensional) optimization problems, the extended network provably escapes any local minimum or stationary point. Under some assumptions on the approximability of the data with stable neural networks, we show that the algorithm achieves an optimal convergence rate s in the sense that loss is bounded by the number of parameters to the -s. As a byproduct, we obtain computable indicators which judge the optimality of the training state of a given network and derive a new notion of generalization error.

Towards optimal hierarchical training of neural networks

TL;DR

Abstract

Paper Structure (18 sections, 14 theorems, 101 equations, 3 figures, 2 algorithms)

This paper contains 18 sections, 14 theorems, 101 equations, 3 figures, 2 algorithms.

Introduction
Contributions of this work
Other approaches and related work
Notation and Definitions
Loss function and training data
Hierarchical algorithm and main result
Neural network calculus
Quasi-optimal hierarchical training
Numerical stability of neural networks
Scaling laws for (deep) stable networks
Analysis of Algorithm \ref{['alg:inner']}
A computable bound on the optimality of the loss
Hierarchical training of deeper networks
Partial training of the final layers
Partial training of the first layers
...and 3 more sections

Key Result

Lemma 2.3

\newlabellem:chainrule0 Any two network realizations $F,G\in{\mathcal{R}}(\chi)$ satisfy for $\alpha\geq 0$

Figures (3)

Figure 1: Schematic of the stability assumption in \ref{['eq:stability0']} and Definition \ref{['def:Lstable']}. The black vectors represent the $\boldsymbol{z}_i$. While the configuration on the right-hand side is okay, the left-hand side configuration violates \ref{['eq:stability0']} due to cancellation.
Figure 1: Comparison of adaptive and uniform algorithm for $f(x,y)=(x+y)^2$ (left), $f(x,y,z )=(x+y+z)^2/3$ (bottom), and $f(x,y)=(x+y)^{2/3}$ (right). We plot the average error over ten training runs. The dashed line represents $\mathcal{O}(n^{-2})$.
Figure 2: Comparison of adaptive and uniform algorithm for $f(\boldsymbol{x})=(\sum_{i=1}^{10}x_i)^2/10$. We plot the average error over ten training runs. The dashed line represents $\mathcal{O}(n^{-2})$.

Theorems & Definitions (42)

Remark 2.1
Lemma 2.3
Proof 1
Lemma 2.4
Proof 2
Lemma 3.1
Proof 3
Remark 3.2
Definition 3.3
Remark 3.4
...and 32 more

Towards optimal hierarchical training of neural networks

TL;DR

Abstract

Towards optimal hierarchical training of neural networks

Authors

TL;DR

Abstract

Table of Contents

Key Result

Figures (3)

Theorems & Definitions (42)