Table of Contents
Fetching ...

Leveraging chaotic transients in the training of artificial neural networks

Pedro Jiménez-González, Miguel C. Soriano, Lucas Lacasa

TL;DR

The dynamics of the neural network trajectory along training for unconventionally large learning rates is explored, showing that for a region of values of the learning rate, the GD optimization shifts away from purely exploitation-like algorithm into a regime of exploration-exploitation balance, as the neural network is still capable of learning but the trajectory shows sensitive dependence on initial conditions.

Abstract

Traditional algorithms to optimize artificial neural networks when confronted with a supervised learning task are usually exploitation-type relaxational dynamics such as gradient descent (GD). Here, we explore the dynamics of the neural network trajectory along training for unconventionally large learning rates. We show that for a region of values of the learning rate, the GD optimization shifts away from purely exploitation-like algorithm into a regime of exploration-exploitation balance, as the neural network is still capable of learning but the trajectory shows sensitive dependence on initial conditions --as characterized by positive network maximum Lyapunov exponent--. Interestingly, the characteristic training time required to reach an acceptable accuracy in the test set reaches a minimum precisely in such learning rate region, further suggesting that one can accelerate the training of artificial neural networks by locating at the onset of chaos. Our results --initially illustrated for the MNIST classification task-- qualitatively hold for a range of supervised learning tasks, {learning architectures (including both shallow and deep multilayer perceptrons and convolutional neural networks) and other hyperparameters (different activation functions and weight regularisation),} and showcase the emergent, constructive role of transient chaotic dynamics in the training of artificial neural networks.

Leveraging chaotic transients in the training of artificial neural networks

TL;DR

The dynamics of the neural network trajectory along training for unconventionally large learning rates is explored, showing that for a region of values of the learning rate, the GD optimization shifts away from purely exploitation-like algorithm into a regime of exploration-exploitation balance, as the neural network is still capable of learning but the trajectory shows sensitive dependence on initial conditions.

Abstract

Traditional algorithms to optimize artificial neural networks when confronted with a supervised learning task are usually exploitation-type relaxational dynamics such as gradient descent (GD). Here, we explore the dynamics of the neural network trajectory along training for unconventionally large learning rates. We show that for a region of values of the learning rate, the GD optimization shifts away from purely exploitation-like algorithm into a regime of exploration-exploitation balance, as the neural network is still capable of learning but the trajectory shows sensitive dependence on initial conditions --as characterized by positive network maximum Lyapunov exponent--. Interestingly, the characteristic training time required to reach an acceptable accuracy in the test set reaches a minimum precisely in such learning rate region, further suggesting that one can accelerate the training of artificial neural networks by locating at the onset of chaos. Our results --initially illustrated for the MNIST classification task-- qualitatively hold for a range of supervised learning tasks, {learning architectures (including both shallow and deep multilayer perceptrons and convolutional neural networks) and other hyperparameters (different activation functions and weight regularisation),} and showcase the emergent, constructive role of transient chaotic dynamics in the training of artificial neural networks.

Paper Structure

This paper contains 6 equations, 14 figures.

Figures (14)

  • Figure 1: Training loss trajectory of a neural network on the MNIST dataset for three different learning rates: $\eta=0.01$, $\eta=7.5$ and $\eta=20$.
  • Figure 2: Semi-log plot of the evolution (along training) of the network distance $d(t)$ for pairs of network trajectories with closeby initialization $\Omega(0)$, as a function of the number of epochs $t$, for a shallow MLP with $\tanh()$ activation function trained on MNIST with a large learning rate. $d(t)$ displays a stylized exponential expansion followed by saturation. The slope of the exponential phase corresponds to the local network Lyapunov exponent $\Lambda$ and is indicative of chaotic mixing.
  • Figure 3: Same as Fig \ref{['fig:d']}, but for many different $\epsilon$-balls centered at different initial conditions $\Omega$. Each initial condition leads in principle to a different local network Lyapunov exponent $\Lambda(\Omega)$. In the inset, we display the histogram of local network Lyapunov exponents. The average of this distribution is the estimation of the network MLE $\lambda_{\text{nMLE}}\approx 0.68$.
  • Figure 4: Estimation of the network Maximum Lyapunov Exponent $\lambda_{\text{nMLE}}$ for MLP trajectories as a function of the learning rate $\eta$. Error bars denote $\pm$ one standard deviation of the population of finite local network Lyapunov exponents $\{\Lambda(\Omega)\}$. The onset of sensitivity to initial conditions $\lambda_{\text{nMLE}}>0$ marks the change from a purely exploitation-type optimization to an exploration/exploitation type.
  • Figure 5: Blue diamonds depict $\rho$, the percentage of MLP initializations $\Omega$ leading to training trajectories with positive local Lyapunov exponent $\Lambda(\Omega)>0$ as a function of the learning rate $\eta$. In the same figure, we also plot (red dots) the average training time $\langle \tau \rangle$ (in number of Gradient Descent epochs) needed to reach an accuracy of 90% or larger in the test set. Training is found to be maximally efficient close to the onset of fully-developed sensitivity to initial conditions ($\Lambda(\Omega)>0 \ \forall \Omega$).
  • ...and 9 more figures