Table of Contents
Fetching ...

Escape dynamics and implicit bias of one-pass SGD in overparameterized quadratic networks

Dario Bocchi, Theotime Regimbeau, Carlo Lucibello, Luca Saglietti, Chiara Cammarota

Abstract

We analyze the one-pass stochastic gradient descent dynamics of a two-layer neural network with quadratic activations in a teacher--student framework. In the high-dimensional regime, where the input dimension $N$ and the number of samples $M$ diverge at fixed ratio $α= M/N$, and for finite hidden widths $(p,p^*)$ of the student and teacher, respectively, we study the low-dimensional ordinary differential equations that govern the evolution of the student--teacher and student--student overlap matrices. We show that overparameterization ($p>p^*$) only modestly accelerates escape from a plateau of poor generalization by modifying the prefactor of the exponential decay of the loss. We then examine how unconstrained weight norms introduce a continuous rotational symmetry that results in a nontrivial manifold of zero-loss solutions for $p>1$. From this manifold the dynamics consistently selects the closest solution to the random initialization, as enforced by a conserved quantity in the ODEs governing the evolution of the overlaps. Finally, a Hessian analysis of the population-loss landscape confirms that the plateau and the solution manifold correspond to saddles with at least one negative eigenvalue and to marginal minima in the population-loss geometry, respectively.

Escape dynamics and implicit bias of one-pass SGD in overparameterized quadratic networks

Abstract

We analyze the one-pass stochastic gradient descent dynamics of a two-layer neural network with quadratic activations in a teacher--student framework. In the high-dimensional regime, where the input dimension and the number of samples diverge at fixed ratio , and for finite hidden widths of the student and teacher, respectively, we study the low-dimensional ordinary differential equations that govern the evolution of the student--teacher and student--student overlap matrices. We show that overparameterization () only modestly accelerates escape from a plateau of poor generalization by modifying the prefactor of the exponential decay of the loss. We then examine how unconstrained weight norms introduce a continuous rotational symmetry that results in a nontrivial manifold of zero-loss solutions for . From this manifold the dynamics consistently selects the closest solution to the random initialization, as enforced by a conserved quantity in the ODEs governing the evolution of the overlaps. Finally, a Hessian analysis of the population-loss landscape confirms that the plateau and the solution manifold correspond to saddles with at least one negative eigenvalue and to marginal minima in the population-loss geometry, respectively.

Paper Structure

This paper contains 45 sections, 109 equations, 6 figures.

Figures (6)

  • Figure 1: Comparison between the numerical solution of the ODEs (solid lines) and the average over 10 simulated dynamics (dashed lines), all initialized with the same random overlap configuration, for $p = 6$, $p^{*} = 3$, $\eta = 0.01$, and $N = 10^4$. The shaded region indicates the standard error of the mean for the simulations. Top left: teacher-student overlaps. Top right: student-student overlaps. Bottom left: student norms. Bottom right: generalization errors.
  • Figure 2: Numerical analysis of the optimal learning rate $\eta$ for student-teacher training dynamics. Left: Value of $\alpha_{c}$ at which the loss is reduced to half of its initial value, shown as a function of $\eta$ for fixed $p = p^* = 3, \,\epsilon=10^{-2}$. Each point of the curve corresponds to the average over 20 different random initializations. Right: The optimal learning rate as a function of student width $p$, revealing a linear scaling trend.
  • Figure 3: Analysis using the numerical solution of the ODEs for $p = 6$, $p^{*} = 3$, $\eta = 0.01$, $\epsilon=0.01$. Left: Euclidean distance between the evolving order parameter $\bm{\rho}(t)$ and the predicted zero-error solution $\overline{\bm{\rho}}$, shown in black for the trajectory initialized with $\bm{\rho}_0$ (used to compute $\overline{\bm{\rho}}$), and in gray for several trajectories initialized with uncorrelated random matrices. Right: Evolution of the corresponding generalization error for all initializations.
  • Figure 4: Evolution of the elements of the matrix $S(t)$ over time for a random orthogonal initialization, illustrating their numerical conservation during the learning dynamics.
  • Figure 5: Degrees of freedom of the student network $DOF$ as a function of the hidden layer width $p$.
  • ...and 1 more figures