Implicit Bias of Mirror Flow for Shallow Neural Networks in Univariate Regression

Shuang Liang; Guido Montúfar

Implicit Bias of Mirror Flow for Shallow Neural Networks in Univariate Regression

Shuang Liang, Guido Montúfar

TL;DR

This work analyzes the implicit bias of mirror flow for shallow, infinitely wide neural networks in univariate regression, showing lazy training and, in the infinite-width limit, equivalence to gradient flow for unscaled potentials. It provides a function-space variational description of the bias for ReLU activations and demonstrates that scaled potentials yield lazy training without kernel-regime behavior, producing biases not generally expressible as RKHS norms. The results clarify how initialization and the geometry of the training potential jointly regulate curvature penalties and the resulting learned function, with practical implications for regularization via geometry. The methodology combines linearization arguments, a minimal representation-cost framework, and a precise translation from parameter space to function space, offering a rigorous lens on when and how mirror-based optimization shapes learned regressors.

Abstract

We examine the implicit bias of mirror flow in univariate least squares error regression with wide and shallow neural networks. For a broad class of potential functions, we show that mirror flow exhibits lazy training and has the same implicit bias as ordinary gradient flow when the network width tends to infinity. For ReLU networks, we characterize this bias through a variational problem in function space. Our analysis includes prior results for ordinary gradient flow as a special case and lifts limitations which required either an intractable adjustment of the training data or networks with skip connections. We further introduce scaled potentials and show that for these, mirror flow still exhibits lazy training but is not in the kernel regime. For networks with absolute value activations, we show that mirror flow with scaled potentials induces a rich class of biases, which generally cannot be captured by an RKHS norm. A takeaway is that whereas the parameter initialization determines how strongly the curvature of the learned function is penalized at different locations of the input space, the scaled potential determines how the different magnitudes of the curvature are penalized.

Implicit Bias of Mirror Flow for Shallow Neural Networks in Univariate Regression

TL;DR

Abstract

Paper Structure (47 sections, 30 theorems, 297 equations, 9 figures)

This paper contains 47 sections, 30 theorems, 297 equations, 9 figures.

Introduction
Contributions
Related works
Implicit bias of mirror descent
Implicit bias of gradient descent for overparametrized networks
Representation cost
Problem Setup
Main Results
Implicit bias of mirror flow with unscaled potentials
Implicit bias of mirror flow with scaled potentials
General framework for the proof of the main results
Linearization
Implicit bias in parameter space
Function space description of the implicit bias
Experiments
...and 32 more sections

Key Result

Theorem 2

Consider a two layer ReLU network model with $d\geq 1$ input units and $n$ hidden units, where we assume $n$ is sufficiently large. Consider parameter initialization init and, specifically, let $p_{\mathcal{B}}(b)$ denote the density function for random variable $\mathcal{B}$. Consider any finite tr Moreover, letting $\theta(\infty)=\lim_{t\rightarrow \infty} \theta(t)$, we have for any given $x\i

Figures (9)

Figure 1: Illustration of Theorem \ref{['thm:ib-md-unscaled']}. Left: ReLU networks with $4860$ hidden neurons, uniformly initialized input biases and zero initialized output weights and biases, trained with mirror flow on a common data set using unscaled potentials: $\phi_1= x^2$, $\phi_2 = |x|^3 + x^2$, and $\phi_3 = x^4 + x^2$. Middle: $L^\infty$-error between the solution to the variational problem and the networks trained using mirror descent with $\phi_1$, $\phi_2$ and $\phi_3$, against the network width. Right: ReLU networks trained with gradient descent on five different data sets, each obtained by translating the same data set along the $y$-axis.
Figure 2: Illustration of Theorem \ref{['thm:ib-md-scaled']}. Left: absolute value networks with $4860$ hidden neurons, uniformly initialized input biases and zero initialized output weights and biases, trained with mirror descent on a common data set using scaled potentials: $\phi_1= x^2$, $\phi_2 = |x|^3 + x^2$, and $\phi_3= x^4 + x^2$. For comparison we also plot a network with Gaussian initialized input biases, trained with gradient descent. Middle: $L^\infty$-error between the solution to the variational problems and networks trained using mirror descent for scaled potentials $\phi_1$, $\phi_2$ and $\phi_3$, against the network width. Right: the distribution of the magnitude of the second derivative of the solutions to the variational problems for $\phi_1$ (blue), $\phi_2$ (orange), and $\phi_3$ (green). The inset shows the second derivatives over the input domain.
Figure 3: Left: 2D PCA representation of parameter trajectories under mirror descent with unscaled potentials $\phi_1= x^2$, $\phi_2 = |x|^3 + x^2$, and $\phi_3= x^4 + x^2$, for ReLU networks with $4860$ hidden units. Right: 2D PCA representation of parameter trajectories under mirror descent with scaled potentials $\phi_1$, $\phi_2$, and $\phi_3$, for networks with absolute value activations and $4860$ hidden units.
Figure 4: Left: $\ell_\infty$ norm of the difference between the final parameter and initial parameter of ReLU networks trained by mirror descent with different unscaled potentials: $\phi_1= x^2$, $\phi_2= |x|^3 + x^2$, and $\phi_3= x^4 + x^2$, plotted against the number of hidden units. Right: spectral norm ($2$-norm) of the difference between the final kernel matrix and the initial kernel matrix from the same training instances as the left panel, plotted against the number of hidden units.
Figure 5: Left panel: $\ell_\infty$ norm of the difference between the final parameter and initial parameter of absolute value networks trained by mirror descent with different scaled potentials: $\phi_1= x^2$, $\phi_2= |x|^3 + x^2$, and $\phi_3= x^4 + x^2$, plotted against the number of hidden units. Right panel: spectral norm ($2$-norm) of the difference between the final kernel matrix and initial kernel matrix from the same training instances as the left panel, plotted against the number of hidden units. The inset shows the same data with the y-axis in linear scale instead of log-scale.
...and 4 more figures

Theorems & Definitions (53)

Theorem 2: Implicit bias of mirror flow for wide ReLU network
Remark 3: Relaxed potential assumption
Remark 4: Gradient flow with reparametrization
Remark 5: Absolute value activations
Remark 6: Skip connections
Remark 7: Natural gradient descent
Theorem 9: Implicit bias of scaled mirror flow for wide absolute value network
Remark 10: Scaled natural gradient descent
Theorem 11: gunasekar2018characterizing
Proposition 12
...and 43 more

Implicit Bias of Mirror Flow for Shallow Neural Networks in Univariate Regression

TL;DR

Abstract

Implicit Bias of Mirror Flow for Shallow Neural Networks in Univariate Regression

Authors

TL;DR

Abstract

Table of Contents

Key Result

Figures (9)

Theorems & Definitions (53)