Table of Contents
Fetching ...

On the Learning Dynamics of Deep Neural Networks

Remi Tachet, Mohammad Pezeshki, Samira Shabanian, Aaron Courville, Yoshua Bengio

TL;DR

The paper analyzes the learning dynamics of deep nonlinear networks for binary classification under strong separability assumptions, revealing that learning occurs in independent modes and follows sigmoidal curves with depth accelerating convergence. It contrasts cross-entropy and hinge losses, showing hinge loss yields faster, margin-driven learning and can improve generative-adversarial training signals. A key phenomenon, gradient starvation, explains how frequent features can bottleneck learning of rarer but informative features, with empirical validation on image data. The work also develops phase-diagram insights to characterize initialization regions leading to success or failure and extends the analysis to deeper architectures and multiclass extensions under structured assumptions. Overall, the findings offer theoretical grounding for observed training dynamics, implicit regularization, and loss-function effects in deep learning, while highlighting the limitations imposed by strong simplifying assumptions.

Abstract

While a lot of progress has been made in recent years, the dynamics of learning in deep nonlinear neural networks remain to this day largely misunderstood. In this work, we study the case of binary classification and prove various properties of learning in such networks under strong assumptions such as linear separability of the data. Extending existing results from the linear case, we confirm empirical observations by proving that the classification error also follows a sigmoidal shape in nonlinear architectures. We show that given proper initialization, learning expounds parallel independent modes and that certain regions of parameter space might lead to failed training. We also demonstrate that input norm and features' frequency in the dataset lead to distinct convergence speeds which might shed some light on the generalization capabilities of deep neural networks. We provide a comparison between the dynamics of learning with cross-entropy and hinge losses, which could prove useful to understand recent progress in the training of generative adversarial networks. Finally, we identify a phenomenon that we baptize gradient starvation where the most frequent features in a dataset prevent the learning of other less frequent but equally informative features.

On the Learning Dynamics of Deep Neural Networks

TL;DR

The paper analyzes the learning dynamics of deep nonlinear networks for binary classification under strong separability assumptions, revealing that learning occurs in independent modes and follows sigmoidal curves with depth accelerating convergence. It contrasts cross-entropy and hinge losses, showing hinge loss yields faster, margin-driven learning and can improve generative-adversarial training signals. A key phenomenon, gradient starvation, explains how frequent features can bottleneck learning of rarer but informative features, with empirical validation on image data. The work also develops phase-diagram insights to characterize initialization regions leading to success or failure and extends the analysis to deeper architectures and multiclass extensions under structured assumptions. Overall, the findings offer theoretical grounding for observed training dynamics, implicit regularization, and loss-function effects in deep learning, while highlighting the limitations imposed by strong simplifying assumptions.

Abstract

While a lot of progress has been made in recent years, the dynamics of learning in deep nonlinear neural networks remain to this day largely misunderstood. In this work, we study the case of binary classification and prove various properties of learning in such networks under strong assumptions such as linear separability of the data. Extending existing results from the linear case, we confirm empirical observations by proving that the classification error also follows a sigmoidal shape in nonlinear architectures. We show that given proper initialization, learning expounds parallel independent modes and that certain regions of parameter space might lead to failed training. We also demonstrate that input norm and features' frequency in the dataset lead to distinct convergence speeds which might shed some light on the generalization capabilities of deep neural networks. We provide a comparison between the dynamics of learning with cross-entropy and hinge losses, which could prove useful to understand recent progress in the training of generative adversarial networks. Finally, we identify a phenomenon that we baptize gradient starvation where the most frequent features in a dataset prevent the learning of other less frequent but equally informative features.

Paper Structure

This paper contains 24 sections, 8 theorems, 54 equations, 10 figures, 2 tables.

Key Result

Lemma 3.1

For any $k \in \{1, 2\}$, $x \in D_k$ and $t \geq 0$, the only non-negative elements of $W_t x$ are the ones with an index $i \in \mathcal{I}_k$. The signs of the coordinates of $Z_t$ remain the same throughout training.

Figures (10)

  • Figure 1: Network Architecture
  • Figure 2: Left. Phase diagram representing the dynamics of learning for the couple $(z_t, y_t)$ depending on its initialization. $y_t$ is the value for the class considered, in which all examples have lined up. Each couple lives on a hyperbola. The slope of the linear curves is equal $\pm\|x\|$ (set to $0.7$ in this diagram). The green region represents the initializations of $(y, z)$ where the classification task will be solved by the network. In the red region, learning does not start (the neuron is inactive at the beginning of training) or collapses as the neuron dies off when $y$ reaches $0$. The $c_i$ points show the three cases from Section \ref{['phase']}. Right.$y_t$ and $P_t(x \in D_1) = \sigma(z_t y_t)$ for different values of $c$ and $\|x\|$.
  • Figure 3: Left. Solutions of (\ref{['ode_system']}) for different initializations and $c = 1$. Right. Values of $\alpha_t$, $\beta_t$ and $P_t$ the confidence of the classifier on an example from class $D_1$ for three different initializations. The "full" curves correspond to $(\alpha_0, \beta_0, z_0) = (0.1, 0.1, 0.1)$i.e. a trajectory in the green region where $\beta_t$ reaches 0 (orange curve). The confidence on class $D_1$ tends to 1 (green curve). The "dashed" curves correspond to $(\alpha_0, \beta_0, z_0) = (0.2, 0.9, 0.2)$i.e. a trajectory in the yellow region, corresponding to $\alpha_t$ reaching $0$ (red curve). The confidence on $D_1$ goes to $0.5$ in that case (brown curve), and the confidence on class $D_2$ goes to 1 (not shown). The "dash-dotted" curves correspond to $(\alpha_0, \beta_0, z_0) = (1, 0, -1.1)$ and are an instance of the aforementioned failure mode: $\alpha_t$ (or equivalently $y_t$) tends to $0$ (pink curve), $\beta_t$ (not shown) remains $0$ and $P_t$ tends to $0.5$ (grey curve).
  • Figure 4: Logit $u(t)$ and confidence $P_t$ for different number of layers and values of $\|x\|$.
  • Figure 5: Left. The three figures on the left are the result of training a generative adversarial network on 8 Gaussians (see Appendix \ref{['app-E']} for details on the experiment). The samples from the hinge loss are incomparably better. Right. Comparison between hinge loss and binary cross-entropy: training time required to reach a confidence $\delta$ on the classification problem. Subplot: Solutions of Eqs. \ref{['eq-main:proba']} and \ref{['eq-main:res_hinge']}.
  • ...and 5 more figures

Theorems & Definitions (17)

  • Lemma 3.1
  • Theorem 3.2
  • proof
  • Corollary 3.3
  • Theorem 3.4
  • Theorem 5.1
  • proof
  • Theorem 6.1
  • proof
  • proof
  • ...and 7 more