Table of Contents
Fetching ...

Phase Diagram of Dropout for Two-Layer Neural Networks in the Mean-Field Regime

Lénaïc Chizat, Pierre Marion, Yerkin Yesbay

TL;DR

This work analyzes dropout in two-layer neural networks under mean-field scaling to uncover how training dynamics evolve as the width grows. By decomposing dropout into propagation noise, random-metric updates, and a bias dropout penalty, the authors classify the limiting behavior into four nondegenerate regimes (discrete jump, Wasserstein gradient flow with penalization, continuous-time jump, and a critical regime) and provide explicit limit equations for each. The results reveal that the classic dropout penalty is only effective at impractically small learning rates, while at larger rates dropout acts like a random metric or a penalized gradient flow, with a time-averaged NTK in some cases. The analysis combines mean-field particle techniques, stochastic approximation theory, and coupling methods to establish convergence in path and distribution spaces, laying groundwork for a deeper theoretical understanding of dropout in large-scale networks. These findings have implications for modeling training dynamics in wide networks and for designing practical alternatives that emulate dropout effects without slow learning.

Abstract

Dropout is a standard training technique for neural networks that consists of randomly deactivating units at each step of their gradient-based training. It is known to improve performance in many settings, including in the large-scale training of language or vision models. As a first step towards understanding the role of dropout in large neural networks, we study the large-width asymptotics of gradient descent with dropout on two-layer neural networks with the mean-field initialization scale. We obtain a rich asymptotic phase diagram that exhibits five distinct nondegenerate phases depending on the relative magnitudes of the dropout rate, the learning rate, and the width. Notably, we find that the well-studied "penalty" effect of dropout only persists in the limit with impractically small learning rates of order $O(1/\text{width})$. For larger learning rates, this effect disappears and in the limit, dropout is equivalent to a "random geometry" technique, where the gradients are thinned randomly after the forward and backward pass have been computed. In this asymptotic regime, the limit is described by a mean-field jump process where the neurons' update times follow independent Poisson or Bernoulli clocks (depending on whether the learning rate vanishes or not). For some of the phases, we obtain a description of the limit dynamics both in path-space and in distribution-space. The convergence proofs involve a mix of tools from mean-field particle systems and stochastic processes. Together, our results lay the groundwork for a renewed theoretical understanding of dropout in large-scale neural networks.

Phase Diagram of Dropout for Two-Layer Neural Networks in the Mean-Field Regime

TL;DR

This work analyzes dropout in two-layer neural networks under mean-field scaling to uncover how training dynamics evolve as the width grows. By decomposing dropout into propagation noise, random-metric updates, and a bias dropout penalty, the authors classify the limiting behavior into four nondegenerate regimes (discrete jump, Wasserstein gradient flow with penalization, continuous-time jump, and a critical regime) and provide explicit limit equations for each. The results reveal that the classic dropout penalty is only effective at impractically small learning rates, while at larger rates dropout acts like a random metric or a penalized gradient flow, with a time-averaged NTK in some cases. The analysis combines mean-field particle techniques, stochastic approximation theory, and coupling methods to establish convergence in path and distribution spaces, laying groundwork for a deeper theoretical understanding of dropout in large-scale networks. These findings have implications for modeling training dynamics in wide networks and for designing practical alternatives that emulate dropout effects without slow learning.

Abstract

Dropout is a standard training technique for neural networks that consists of randomly deactivating units at each step of their gradient-based training. It is known to improve performance in many settings, including in the large-scale training of language or vision models. As a first step towards understanding the role of dropout in large neural networks, we study the large-width asymptotics of gradient descent with dropout on two-layer neural networks with the mean-field initialization scale. We obtain a rich asymptotic phase diagram that exhibits five distinct nondegenerate phases depending on the relative magnitudes of the dropout rate, the learning rate, and the width. Notably, we find that the well-studied "penalty" effect of dropout only persists in the limit with impractically small learning rates of order . For larger learning rates, this effect disappears and in the limit, dropout is equivalent to a "random geometry" technique, where the gradients are thinned randomly after the forward and backward pass have been computed. In this asymptotic regime, the limit is described by a mean-field jump process where the neurons' update times follow independent Poisson or Bernoulli clocks (depending on whether the learning rate vanishes or not). For some of the phases, we obtain a description of the limit dynamics both in path-space and in distribution-space. The convergence proofs involve a mix of tools from mean-field particle systems and stochastic processes. Together, our results lay the groundwork for a renewed theoretical understanding of dropout in large-scale neural networks.

Paper Structure

This paper contains 41 sections, 19 theorems, 148 equations, 6 figures, 2 tables.

Key Result

Theorem 1

Let Assumption ass:phi and ass:nondegenerate hold. Consider the sequence of GD-dropout dynamics eq:GD-dropout-weights with $\mu_0\in \mathcal{P}_1(\mathbb{R}^p)$. Then the limiting dynamics can be classified in four cases depending on the scaling of HP:

Figures (6)

  • Figure 1: Phase diagram of dropout in two-layer NN with mean-field scaling. For an HP scaling $(n,\tau_n^{-1},q_n^{-1})$, the limit of $(\log n, -\log \tau_n , -\log q_n)/S_n$ where $S_n= \log n -\log \tau_n -\log q_n$ (when it exists) forms a point in the $2$-simplex, which is represented in the triangle in barycentric coordinates. For instance, the central red point represents proportional limits while the blue line corresponds to $\tau_n^{-1}$ and $q_n^{-1}$ diverging proportionally while $n$ diverges faster. Red area: degenerate limits with either $\alpha=+\infty$ or $\beta=+\infty$. Grey area: Wasserstein gradient flow limit (the effect of dropout disappears in the limit). Orange vertex: discrete-time jump process limit (if $\tau,q>0$). Blue line: continuous-time jump process limit. Red vertex: critical limit. Green line: penalized Wasserstein gradient flow limit.
  • Figure 2: Illustration of the pathwise convergence between random metric (RaM) and dropout dynamics (Proposition \ref{['prop:ram-equivalent']}). We train a width-5000 two-layer NN on a synthetic teacher-student task with the quadratic loss, either with GD (orange), GD with dropout (blue) or GD with RaM (green), with coupled randomness (of initialization and masks $(\eta^i_k)$). The paths shown correspond to two-dimensional projections of the trajectory of two randomly chosen neurons. The pathwise similarity between dropout and RaM illustrates Proposition \ref{['prop:ram-equivalent']} (the similarity degrades at large times because the width is finite). More details and additional plots are given in Appendix \ref{['subsec:exp-details-trajs']}.
  • Figure 3: Comparison of variants of dropout for the training of two-layer NNs on two classes of MNIST with logistic loss for several variants of dropout (see details in text), as a function of training steps (left) and of keep probability (right). More details are given in Appendix \ref{['subsec:exp-details-mnist']}.
  • Figure 4: Illustration of the pathwise convergence between random metric (RaM) and dropout dynamics (Proposition \ref{['prop:ram-equivalent']}) for the teacher-student experiment. Same realization as in Figure \ref{['fig:dropout-ram']}, adding GD with propagation noise and random geometry (PN + RaM).
  • Figure 5: RMS distance in parameter space between the dropout algorithm and the three other tested algorithms (plain GD, GD with random geometry and GD with propagation noise and random geometry) for the teacher-student experiment.
  • ...and 1 more figures

Theorems & Definitions (36)

  • Theorem 1: Main theorem
  • proof
  • Proposition 2: Explicit-implicit penalty equivalence
  • proof
  • Proposition 3: Dropout-RaM equivalence
  • proof
  • Lemma 4
  • proof
  • Lemma 5
  • proof
  • ...and 26 more