Table of Contents
Fetching ...

A simple theory for training response of deep neural networks

Kenichi Nakazato

TL;DR

The paper studies training dynamics in deep networks by proposing a minimal toy model with a single hidden layer trained by SGD. It shows that training responses can exhibit aging with a near-constant interaction kernel, reproducing power-law-like decay observed in more complex systems. The results reveal how activation functions and stochastic training drive feature-space reduction and potential network fragility, offering a principled explanation for generalization-versus-robustness phenomena. This simple framework provides intuition for when NTK-like behavior holds and how nonlinear effects reshape learning dynamics, with implications for designing robust training regimes.

Abstract

Deep neural networks give us a powerful method to model the training dataset's relationship between input and output. We can regard that as a complex adaptive system consisting of many artificial neurons that work as an adaptive memory as a whole. The network's behavior is training dynamics with a feedback loop from the evaluation of the loss function. We already know the training response can be constant or shows power law-like aging in some ideal situations. However, we still have gaps between those findings and other complex phenomena, like network fragility. To fill the gap, we introduce a very simple network and analyze it. We show the training response consists of some different factors based on training stages, activation functions, or training methods. In addition, we show feature space reduction as an effect of stochastic training dynamics, which can result in network fragility. Finally, we discuss some complex phenomena of deep networks.

A simple theory for training response of deep neural networks

TL;DR

The paper studies training dynamics in deep networks by proposing a minimal toy model with a single hidden layer trained by SGD. It shows that training responses can exhibit aging with a near-constant interaction kernel, reproducing power-law-like decay observed in more complex systems. The results reveal how activation functions and stochastic training drive feature-space reduction and potential network fragility, offering a principled explanation for generalization-versus-robustness phenomena. This simple framework provides intuition for when NTK-like behavior holds and how nonlinear effects reshape learning dynamics, with implications for designing robust training regimes.

Abstract

Deep neural networks give us a powerful method to model the training dataset's relationship between input and output. We can regard that as a complex adaptive system consisting of many artificial neurons that work as an adaptive memory as a whole. The network's behavior is training dynamics with a feedback loop from the evaluation of the loss function. We already know the training response can be constant or shows power law-like aging in some ideal situations. However, we still have gaps between those findings and other complex phenomena, like network fragility. To fill the gap, we introduce a very simple network and analyze it. We show the training response consists of some different factors based on training stages, activation functions, or training methods. In addition, we show feature space reduction as an effect of stochastic training dynamics, which can result in network fragility. Finally, we discuss some complex phenomena of deep networks.
Paper Structure (6 sections, 13 equations, 7 figures)

This paper contains 6 sections, 13 equations, 7 figures.

Figures (7)

  • Figure 1: Training response with ReLU. The decay of the training response and response kernel are shown on the left and right, respectively. On the left, the training response, $\Delta(\bm{x}_o,\bm{x}_o)$, is plotted against the training epoch. On the right, the response kernel is plotted against the Hamming distance, $|\bm{x}_o-\bm{x}|$. The input size is 8 and the size of the feature is 128.
  • Figure 2: Training response with ELU. The settings are the same as FIG. \ref{['fig:reluTR']}.
  • Figure 3: The training responses with noise. The training setting is the same one as the FIG. \ref{['fig:reluTR']}. The upper plots show the results of the training with ReLU. The bottom ones show the results with ELU.
  • Figure 4: The training responses for regression. The setting is the same one as the FIG. \ref{['fig:reluTR']}, except for the training data point, $(\bm{x}_o,y_o)$, has a different value, $0\leq y_o\leq 1$.
  • Figure 5: The training responses with a simplified system. On the left, we show results with a target value, $y_o=1$, and different training rates, $\eta$, and slope parameters, $a$. On the right, we show results with two target values, $0<y_o<1.0$. Training rate and slope parameters are the same one, $\eta=1.0$ and $a=1.0$.
  • ...and 2 more figures