Phase Diagram of Dropout for Two-Layer Neural Networks in the Mean-Field Regime

Lénaïc Chizat; Pierre Marion; Yerkin Yesbay

Phase Diagram of Dropout for Two-Layer Neural Networks in the Mean-Field Regime

Lénaïc Chizat, Pierre Marion, Yerkin Yesbay

TL;DR

This work analyzes dropout in two-layer neural networks under mean-field scaling to uncover how training dynamics evolve as the width grows. By decomposing dropout into propagation noise, random-metric updates, and a bias dropout penalty, the authors classify the limiting behavior into four nondegenerate regimes (discrete jump, Wasserstein gradient flow with penalization, continuous-time jump, and a critical regime) and provide explicit limit equations for each. The results reveal that the classic dropout penalty is only effective at impractically small learning rates, while at larger rates dropout acts like a random metric or a penalized gradient flow, with a time-averaged NTK in some cases. The analysis combines mean-field particle techniques, stochastic approximation theory, and coupling methods to establish convergence in path and distribution spaces, laying groundwork for a deeper theoretical understanding of dropout in large-scale networks. These findings have implications for modeling training dynamics in wide networks and for designing practical alternatives that emulate dropout effects without slow learning.

Abstract

Dropout is a standard training technique for neural networks that consists of randomly deactivating units at each step of their gradient-based training. It is known to improve performance in many settings, including in the large-scale training of language or vision models. As a first step towards understanding the role of dropout in large neural networks, we study the large-width asymptotics of gradient descent with dropout on two-layer neural networks with the mean-field initialization scale. We obtain a rich asymptotic phase diagram that exhibits five distinct nondegenerate phases depending on the relative magnitudes of the dropout rate, the learning rate, and the width. Notably, we find that the well-studied "penalty" effect of dropout only persists in the limit with impractically small learning rates of order $O(1/\text{width})$. For larger learning rates, this effect disappears and in the limit, dropout is equivalent to a "random geometry" technique, where the gradients are thinned randomly after the forward and backward pass have been computed. In this asymptotic regime, the limit is described by a mean-field jump process where the neurons' update times follow independent Poisson or Bernoulli clocks (depending on whether the learning rate vanishes or not). For some of the phases, we obtain a description of the limit dynamics both in path-space and in distribution-space. The convergence proofs involve a mix of tools from mean-field particle systems and stochastic processes. Together, our results lay the groundwork for a renewed theoretical understanding of dropout in large-scale neural networks.

Phase Diagram of Dropout for Two-Layer Neural Networks in the Mean-Field Regime

TL;DR

Abstract

Phase Diagram of Dropout for Two-Layer Neural Networks in the Mean-Field Regime

TL;DR

Abstract

Paper Structure

Table of Contents

Key Result

Figures (6)

Theorems & Definitions (36)