Table of Contents
Fetching ...

Dynamical stability and chaos in artificial neural network trajectories along training

Kaloyan Danovski, Miguel C. Soriano, Lucas Lacasa

TL;DR

This work treats neural network training as a discrete-time dynamical system in graph (weight) space and analyzes how learning-rate choices shape dynamical and orbital stability of network trajectories. By studying a shallow network on the Iris task, it uncovers a low-$\eta$ regime with non-monotonic, non-chaotic trajectory divergence and evidence for marginal stability due to flat, high-dimensional loss basins. In contrast, larger learning rates reveal an edge-of-stability with positive finite-time Lyapunov exponents and non-monotonic loss, and at very large rates a chaotic-intermittent regime with complex weight dynamics. The findings challenge naive convergence expectations, motivate a cross-disciplinary view combining dynamical-systems tools with ML practice, and suggest further exploration of regularization and architecture-dependent stability across tasks.

Abstract

The process of training an artificial neural network involves iteratively adapting its parameters so as to minimize the error of the network's prediction, when confronted with a learning task. This iterative change can be naturally interpreted as a trajectory in network space -- a time series of networks -- and thus the training algorithm (e.g. gradient descent optimization of a suitable loss function) can be interpreted as a dynamical system in graph space. In order to illustrate this interpretation, here we study the dynamical properties of this process by analyzing through this lens the network trajectories of a shallow neural network, and its evolution through learning a simple classification task. We systematically consider different ranges of the learning rate and explore both the dynamical and orbital stability of the resulting network trajectories, finding hints of regular and chaotic behavior depending on the learning rate regime. Our findings are put in contrast to common wisdom on convergence properties of neural networks and dynamical systems theory. This work also contributes to the cross-fertilization of ideas between dynamical systems theory, network theory and machine learning

Dynamical stability and chaos in artificial neural network trajectories along training

TL;DR

This work treats neural network training as a discrete-time dynamical system in graph (weight) space and analyzes how learning-rate choices shape dynamical and orbital stability of network trajectories. By studying a shallow network on the Iris task, it uncovers a low- regime with non-monotonic, non-chaotic trajectory divergence and evidence for marginal stability due to flat, high-dimensional loss basins. In contrast, larger learning rates reveal an edge-of-stability with positive finite-time Lyapunov exponents and non-monotonic loss, and at very large rates a chaotic-intermittent regime with complex weight dynamics. The findings challenge naive convergence expectations, motivate a cross-disciplinary view combining dynamical-systems tools with ML practice, and suggest further exploration of regularization and architecture-dependent stability across tasks.

Abstract

The process of training an artificial neural network involves iteratively adapting its parameters so as to minimize the error of the network's prediction, when confronted with a learning task. This iterative change can be naturally interpreted as a trajectory in network space -- a time series of networks -- and thus the training algorithm (e.g. gradient descent optimization of a suitable loss function) can be interpreted as a dynamical system in graph space. In order to illustrate this interpretation, here we study the dynamical properties of this process by analyzing through this lens the network trajectories of a shallow neural network, and its evolution through learning a simple classification task. We systematically consider different ranges of the learning rate and explore both the dynamical and orbital stability of the resulting network trajectories, finding hints of regular and chaotic behavior depending on the learning rate regime. Our findings are put in contrast to common wisdom on convergence properties of neural networks and dynamical systems theory. This work also contributes to the cross-fertilization of ideas between dynamical systems theory, network theory and machine learning
Paper Structure (13 sections, 13 equations, 18 figures)

This paper contains 13 sections, 13 equations, 18 figures.

Figures (18)

  • Figure 1: The training process of an ANN is depicted as a network trajectory in graph space, where in each iteration of the optimization scheme the network parameters are updated, leading to a decreasing loss function.
  • Figure 2: Illustration of the Iris dataset and difficulty in linearly separating the three classes. Datapoints are shown in the space of two of their four input features, namely "sepal length" and "sepal width". Colors correspond to different classes, while markers show whether the instances were classified correctly or not (marked as 'x' if the prediction was incorrect).
  • Figure 3: Example showing the evolution of the distances between reference and perturbed trajectories, for a perturbation radius $\epsilon=10^{-8}$. Each panel shows results for a different network initial condition and a random set of perturbations. Gray lines are the distances from individual perturbations, and black is the average distance over all (20) perturbations. Overlaid in blue dashed line (right-hand axis) is the loss trajectory of the network plotted for all perturbations (all loss curves coincide).
  • Figure 4: Evolution of distances between perturbations and reference trajectory for a single initial condition and different values of the perturbation range $\epsilon=\{10^{-14},10^{-10},10^{-6},10^{-2}\}$. Random perturbations are sampled separately for each value of $\epsilon$. Gray lines represent individual perturbations and black line is the mean over perturbations. Note the different scales on the distance axis (left-hand side). The loss of perturbations is overlaid in dashed blue line (right-hand axis).
  • Figure 5: (Left panel) Heatmap for one of the weight matrices of a final solution, i.e. $\bf{W}_1$ at the final iteration. Color corresponds to difference in loss $\mathcal{L}_w$ after disabling each weight individually (i.e. setting $w_{ji}=0$ only and recalculating loss). Difference is shown relative to baseline loss $\mathcal{L}$ when all weights are kept as is. (Right panel) Relationship between per-weight displacements $\Delta_w$ and the total distance travelled by individual weights $D_w$. Results are for the weight matrices of a single network trajectory.
  • ...and 13 more figures