On the weight dynamics of learning networks

Nahal Sharafi; Christoph Martin; Sarah Hallerberg

On the weight dynamics of learning networks

Nahal Sharafi, Christoph Martin, Sarah Hallerberg

TL;DR

This work casts the weight dynamics of a three-layer feed-forward network under gradient-based learning as a dynamical system and derives its tangent (Jacobian) operator to enable local stability analysis. By computing FTLEs, CLVs, and LEs, the study links stability in weight space to training outcomes, showing how initialization and activation choices shape the attractor structure and final loss $c_f$. Crucially, early stability indicators—especially FTLEs—can predict whether training will converge to low or high loss regions, offering potential for early stopping or reinitialization. While the results are demonstrated on a specific regression task, the framework and findings illuminate how dynamical-systems perspectives can inform initialization, activation choice, and monitoring strategies in learning dynamics across architectures.

Abstract

Neural networks have become a widely adopted tool for tackling a variety of problems in machine learning and artificial intelligence. In this contribution we use the mathematical framework of local stability analysis to gain a deeper understanding of the learning dynamics of feed forward neural networks. Therefore, we derive equations for the tangent operator of the learning dynamics of three-layer networks learning regression tasks. The results are valid for an arbitrary numbers of nodes and arbitrary choices of activation functions. Applying the results to a network learning a regression task, we investigate numerically, how stability indicators relate to the final training-loss. Although the specific results vary with different choices of initial conditions and activation functions, we demonstrate that it is possible to predict the final training loss, by monitoring finite-time Lyapunov exponents or covariant Lyapunov vectors during the training process.

On the weight dynamics of learning networks

TL;DR

. Crucially, early stability indicators—especially FTLEs—can predict whether training will converge to low or high loss regions, offering potential for early stopping or reinitialization. While the results are demonstrated on a specific regression task, the framework and findings illuminate how dynamical-systems perspectives can inform initialization, activation choice, and monitoring strategies in learning dynamics across architectures.

Abstract

Paper Structure (10 sections, 8 equations, 15 figures, 1 table)

This paper contains 10 sections, 8 equations, 15 figures, 1 table.

Introduction
Network model and weight dynamics
Deriving the Jacobian of the training dynamics
Numerical results for a specific regression task
Characterizing the training process through indicators of local stability
Predicting outcomes of training processes by monitoring local stability
Conclusions
Appendix
Relations of LEs and $\mathbf{c_f}$ for Wide-Range Initialization
Example for an ROC-curve

Figures (15)

Figure 1: Visualization of a three-layer network with arbitrarily many nodes in the hidden layer. The regression task we present to this network requires two input nodes and one output node.
Figure 2: Visualization of the two-dimensional training data used in this paper. The quadratic relationship between the variables is discernible in the figure.
Figure 3: Distributions of final losses $c_f$ (values of the cost-function at the end of a training run) for 8000 realizations with He initializations exhibit distinct clusters. (a) Final loss values of 4000 network realizations using ReLU as activation. (b) Final loss values of 4000 network realizations using tanh as activation function.
Figure 4: Wide range initialization of network weights creates a wider distribution of training results, allowing non-optimal training results to occur more frequently. Distributions of final losses $c_f$ (values of the cost-function at the end of a training run) of 8000 realizations with wide range random initializations, $\sigma = 20.$.
Figure 5: Distributions of Lyapunov exponents computed for 8000 training runs of networks initialized with He initialization, amplifying input with (a) ReLU as activation function and (b) tanh as activation function.
...and 10 more figures

On the weight dynamics of learning networks

TL;DR

Abstract

On the weight dynamics of learning networks

Authors

TL;DR

Abstract

Table of Contents

Figures (15)