Table of Contents
Fetching ...

Training a Two Layer ReLU Network Analytically

Adrian Barbu

TL;DR

An algorithm for training two-layer neural networks with ReLU-like activation and the square loss that alternatively finds the critical points of the loss function analytically for one layer while keeping the other layer and the neuron activation pattern fixed is explored.

Abstract

Neural networks are usually trained with different variants of gradient descent based optimization algorithms such as stochastic gradient descent or the Adam optimizer. Recent theoretical work states that the critical points (where the gradient of the loss is zero) of two-layer ReLU networks with the square loss are not all local minima. However, in this work we will explore an algorithm for training two-layer neural networks with ReLU-like activation and the square loss that alternatively finds the critical points of the loss function analytically for one layer while keeping the other layer and the neuron activation pattern fixed. Experiments indicate that this simple algorithm can find deeper optima than Stochastic Gradient Descent or the Adam optimizer, obtaining significantly smaller training loss values on four out of the five real datasets evaluated. Moreover, the method is faster than the gradient descent methods and has virtually no tuning parameters.

Training a Two Layer ReLU Network Analytically

TL;DR

An algorithm for training two-layer neural networks with ReLU-like activation and the square loss that alternatively finds the critical points of the loss function analytically for one layer while keeping the other layer and the neuron activation pattern fixed is explored.

Abstract

Neural networks are usually trained with different variants of gradient descent based optimization algorithms such as stochastic gradient descent or the Adam optimizer. Recent theoretical work states that the critical points (where the gradient of the loss is zero) of two-layer ReLU networks with the square loss are not all local minima. However, in this work we will explore an algorithm for training two-layer neural networks with ReLU-like activation and the square loss that alternatively finds the critical points of the loss function analytically for one layer while keeping the other layer and the neuron activation pattern fixed. Experiments indicate that this simple algorithm can find deeper optima than Stochastic Gradient Descent or the Adam optimizer, obtaining significantly smaller training loss values on four out of the five real datasets evaluated. Moreover, the method is faster than the gradient descent methods and has virtually no tuning parameters.
Paper Structure (15 sections, 1 theorem, 15 equations, 11 figures, 5 tables, 1 algorithm)

This paper contains 15 sections, 1 theorem, 15 equations, 11 figures, 5 tables, 1 algorithm.

Key Result

Theorem 1

If the matrix $\mathbf{G}$ from Eq. eq:lossA is fixed, the critical points with respect to $\mathbf{A}$ of the loss function $L(\mathbf{A},\mathbf{B}, \mathbf{b}^0,\mathbf{G})$ from Eq. eq:lossA are solutions of the equation: where $\mathbf{a}=(\mathbf{a}_1^T,...,\mathbf{a}_h^T)^T$ is the matrix $\mathbf{A}$ unraveled.

Figures (11)

  • Figure S1: This work focuses on 2-layer NNs with leaky ReLU-like activation functions $\sigma(x)=\alpha x+ (1-\alpha)\max(0,x)$, with $\alpha\in [0,1)$. Shown are the leaky ReLU ($\alpha=0.1$, left) and the ReLU ($\alpha=0$, right).
  • Figure S2: Left: the binary shape image whose signed distance transform was used to train a shape decoder. Right: the image used for training a denoising autoencoder (DAE).
  • Figure S3: Training MSEs for LBFGS on 100 random splits of the abalone dataset. Left: the training MSEs have many places where LBFGS blows up. Right: considering the minimum train MSE obtained so far at each iteration alleviates the problem.
  • Figure S4: MSE and average $R^2$ vs time (seconds) of 100 runs of training the NN with ReLU activation using the Adam, and ANMIN optimizers. Also plotted are the mean test MSEs and $R^2$ with standard deviation.
  • Figure S5: MSE and average $R^2$ vs time (seconds) of 100 runs of training the NN with ReLU activation using the SGD and ANMIN optimizers. Also plotted are the mean test MSEs and $R^2$ with standard deviation.
  • ...and 6 more figures

Theorems & Definitions (1)

  • Theorem 1