Table of Contents
Fetching ...

Implicit Regularization of Large Neural Networks via Mean-Field Formulation

Beatrice Acciaio, Jakob Heiss, Gudmund Pammer, Qinxin Yan

Abstract

We propose a mathematical framework to explain implicit regularization from early stopping during the training of overparametrized neural networks. In the mean-field limit, the parameter distribution evolves according to a gradient flow on the space of probability measures. We show that these dynamics admit an equivalent McKean-Vlasov stochastic control formulation through the corresponding Hamilton-Jacobi-Bellman (HJB) equation. The control viewpoint yields a Dynamic Programming Principle (DPP), which we use to define a new metric on probability measures. This metric can be viewed as a mean-field generalization of the control representation of the Wasserstein-2 distance, and it naturally appears as a regularization term selected by early stopping. We further obtain non-asymptotic bounds describing how the induced regularization depends on the stopping time.

Implicit Regularization of Large Neural Networks via Mean-Field Formulation

Abstract

We propose a mathematical framework to explain implicit regularization from early stopping during the training of overparametrized neural networks. In the mean-field limit, the parameter distribution evolves according to a gradient flow on the space of probability measures. We show that these dynamics admit an equivalent McKean-Vlasov stochastic control formulation through the corresponding Hamilton-Jacobi-Bellman (HJB) equation. The control viewpoint yields a Dynamic Programming Principle (DPP), which we use to define a new metric on probability measures. This metric can be viewed as a mean-field generalization of the control representation of the Wasserstein-2 distance, and it naturally appears as a regularization term selected by early stopping. We further obtain non-asymptotic bounds describing how the induced regularization depends on the stopping time.
Paper Structure (35 sections, 26 theorems, 216 equations, 2 figures)

This paper contains 35 sections, 26 theorems, 216 equations, 2 figures.

Key Result

Proposition 2.1

For all $\boldsymbol{\mu}\in AC^2(I)$, the limit exists for a.e. $t\in I$. Moreover, the map $t\mapsto|\mu'|(t)$ belongs to $L^2(I)$.

Figures (2)

  • Figure 1: Deterministic diagnostics for the mean-field finite-width training dynamics. Left: energy--dissipation parity, comparing the loss drop $L(\mu_0^N)-L(\mu_t^N)$ with the cumulative mean-field-scaled kinetic action. Middle: second-moment increment $m_2(\mu_t^N)-m_2(\mu_0^N)$ plotted against $\sqrt{t}\sqrt{L(\mu_0^N)-L(\mu_t^N)}$. Right: analogous increments for the Barron-type surrogates $B_1$ and $B_{1,1}$. The three panels indicate that loss dissipation is accompanied by controlled growth of moments and variation-type quantities.
  • Figure 2: Finite-width identity-coupling proxy for Theorem \ref{['thm: earlystoppingbound']}. For each small stopping time $T$, we compare the displacement proxy from initialization of the gradient-flow endpoint $\mu_T^{N,\mathrm{GF}}$ and of an approximate minimizer of the proxy functional $\widetilde{\Phi}_T^{N,\mathrm{id}}(\nu) = L_N(\nu)+\frac{1}{2T}\widetilde{{\mathcal{W}}}_{2,\mathrm{id}}^2(\mu_0^N,\nu).$ The two curves remain of the same order for small $T$, consistent with the short-time endpoint principle.

Theorems & Definitions (72)

  • Definition 1.1
  • Proposition 2.1: Metric derivative
  • Proposition 2.2
  • Definition 2.3
  • Definition 2.4
  • Remark 2.5
  • Definition 2.6: Metric slope on $({\mathcal{P}}_2(E),{\mathcal{W}}_2)$
  • Definition 2.7: Strong Wasserstein subdifferential on ${\mathcal{P}}_2(E)$
  • Definition 2.8: Slope-realizing subgradients
  • Remark 2.9: Slope and subdifferential
  • ...and 62 more