Table of Contents
Fetching ...

Network Dynamics-Based Framework for Understanding Deep Neural Networks

Yuchen Lin, Yong Zhang, Sihan Feng, Hong Zhao

TL;DR

This work proposes a dynamical-systems framework for deep learning centered on two neuron-level transformation modes: order-preserving OPT and non-order-preserving NPT. It introduces Rank Probability Distribution (RPD) and Linear Substitution Map (L-Map) to quantify layer-wise nonlinearity and linearization, and defines attraction basins in both sample and weight spaces to assess robustness and stability. Through analyses of shallow networks and deeper DNNs, the study links OPT/NPT balance and basin dynamics to learning phases, depth/width effects, and phenomena like grokking, showing BN and training strategies crucially shape phase transitions. The framework offers actionable insights for architecture design, initialization schemes, and training protocols to optimize generalization and stability in deep learning systems.

Abstract

Advancements in artificial intelligence call for a deeper understanding of the fundamental mechanisms underlying deep learning. In this work, we propose a theoretical framework to analyze learning dynamics through the lens of dynamical systems theory. We redefine the notions of linearity and nonlinearity in neural networks by introducing two fundamental transformation units at the neuron level: order-preserving transformations and non-order-preserving transformations. Different transformation modes lead to distinct collective behaviors in weight vector organization, different modes of information extraction, and the emergence of qualitatively different learning phases. Transitions between these phases may occur during training, accounting for key phenomena such as grokking. To further characterize generalization and structural stability, we introduce the concept of attraction basins in both sample and weight spaces. The distribution of neurons with different transformation modes across layers, along with the structural characteristics of the two types of attraction basins, forms a set of core metrics for analyzing the performance of learning models. Hyperparameters such as depth, width, learning rate, and batch size act as control variables for fine-tuning these metrics. Our framework not only sheds light on the intrinsic advantages of deep learning, but also provides a novel perspective for optimizing network architectures and training strategies.

Network Dynamics-Based Framework for Understanding Deep Neural Networks

TL;DR

This work proposes a dynamical-systems framework for deep learning centered on two neuron-level transformation modes: order-preserving OPT and non-order-preserving NPT. It introduces Rank Probability Distribution (RPD) and Linear Substitution Map (L-Map) to quantify layer-wise nonlinearity and linearization, and defines attraction basins in both sample and weight spaces to assess robustness and stability. Through analyses of shallow networks and deeper DNNs, the study links OPT/NPT balance and basin dynamics to learning phases, depth/width effects, and phenomena like grokking, showing BN and training strategies crucially shape phase transitions. The framework offers actionable insights for architecture design, initialization schemes, and training protocols to optimize generalization and stability in deep learning systems.

Abstract

Advancements in artificial intelligence call for a deeper understanding of the fundamental mechanisms underlying deep learning. In this work, we propose a theoretical framework to analyze learning dynamics through the lens of dynamical systems theory. We redefine the notions of linearity and nonlinearity in neural networks by introducing two fundamental transformation units at the neuron level: order-preserving transformations and non-order-preserving transformations. Different transformation modes lead to distinct collective behaviors in weight vector organization, different modes of information extraction, and the emergence of qualitatively different learning phases. Transitions between these phases may occur during training, accounting for key phenomena such as grokking. To further characterize generalization and structural stability, we introduce the concept of attraction basins in both sample and weight spaces. The distribution of neurons with different transformation modes across layers, along with the structural characteristics of the two types of attraction basins, forms a set of core metrics for analyzing the performance of learning models. Hyperparameters such as depth, width, learning rate, and batch size act as control variables for fine-tuning these metrics. Our framework not only sheds light on the intrinsic advantages of deep learning, but also provides a novel perspective for optimizing network architectures and training strategies.
Paper Structure (14 sections, 4 equations, 7 figures)

This paper contains 14 sections, 4 equations, 7 figures.

Figures (7)

  • Figure 1: Illustration of transformation modes and their effects. Circled numbers represent the local fields of five samples projected by a weight vector; hollow circles represent neurons. (a) The OPT mode preserves sample order and can be achieved by a single neuron. (b) The NPT mode disrupts the order: samples 4 and 5 output less than sample 3. This requires at least two cooperating neurons under typical monotonic nonlinear activations. (c) OPT-induced weight vectors concentrate to maximize outputs for sample set A, yielding higher projections than for set B. (d) NPT-induced weight vectors are isotropic. Together, (c) and (d) show how RPD reflects the transformation mode composition.
  • Figure 2: Learning dynamics of a shallow network. (a) Test accuracy versus number of training samples for three models: the tanh network, the LNN with $f(h)=h$, and the L-map---same architecture with tanh replaced by $f(h)=h$. (b) Test accuracy versus training epochs using the full training set. Curve ordering matches (a). (c) and (d) RPDs of the tanh network and the LNN after training on 600 samples (c) and on 60,000 samples (d). (e) RPDs of the tanh network at selected training epochs, showing their evolution over time.
  • Figure 3: Training dynamics and RPD analysis. (a)–(c) Evolution of training and test accuracy, together with RPD gradients, in a 10-layer DNN (width 512) trained with SGD. Panels (a) and (b) show results with ReLU activation and learning rates of 0.03 and 0.37, respectively, while panel (c) shows results with linear activation and learning rate 0.03. (d)–(f) Corresponding results obtained using the Adam optimizer. The batch size is 60,000 in (d) and (f), and 20,000 in (e). (g) Results for a 23-layer DNN (width 128) trained with Adam. (h) L-map pruning accuracy as a function of the starting layer for 10-layer and 23-layer networks. The x-axis starts from layer index 2 because index 1 denotes the input layer, and index 2 denotes the first hidden layer. (i) Test accuracy as a function of depth for various widths. (j) First-layer RPD gradient as a function of depth and width.
  • Figure 4: Attraction-basin analysis. (a) Accuracy of noisy training samples vs. noise amplitude. (b) Accuracy of training samples vs. noise amplitude under weight perturbations. (c), (d) Average attraction-basin sizes in sample and weight space, respectively, as a function of network width for various depths. (e) Basin sizes in both sample and weight spaces as a function of learning rate. (f) Basin sizes in both sample and weight spaces as a function of batch size. In (e) and (f), sample-space basin sizes are scaled by a factor of 2 for clarity.
  • Figure 5: Sample attraction basins and class-wise RPD curves for the 23-layer DNN in Fig. \ref{['fig3']}(g). (a) Accuracy of noisy samples from ten classes versus noise amplitude. (b) Class-wise RPD curves at layer 2 (the first hidden layer). (c) Sample attraction basins in a 3D PCA projection of 20 digits. Digits with small attraction basins are more vulnerable to perturbations.
  • ...and 2 more figures