Table of Contents
Fetching ...

The Spectral Bias of Shallow Neural Network Learning is Shaped by the Choice of Non-linearity

Justin Sahs, Ryan Pyle, Fabio Anselmi, Ankit Patel

TL;DR

The paper tackles why overparameterized shallow networks generalize by examining the implicit bias imposed by activation non-linearities. It introduces a Radon Spline (RS) reparameterization that exposes a kernel-regime spectral penalty via a Radon–Fourier view and extends the analysis to an adaptive regime where zero-plane clusters emerge during training. Key contributions include the RS parameterization, a kernel-regime Radon seminorm with a Fourier interpretation, a method to design activations to realize desired spectral penalties, and a dynamic RS/Gaussian-like picture of adaptive learning with datapoint pinning validated by simulations and MNIST experiments. The work provides a mechanistic lens on spectral bias and generalization, offering principled guidance for activation design and training strategies in overparameterized nets.

Abstract

Despite classical statistical theory predicting severe overfitting, modern massively overparameterized neural networks still generalize well. This unexpected property is attributed to the network's so-called implicit bias, which describes its propensity to converge to solutions that generalize effectively, among the many possible that correctly label the training data. The aim of our research is to explore this bias from a new perspective, focusing on how non-linear activation functions contribute to shaping it. First, we introduce a reparameterization which removes a continuous weight rescaling symmetry. Second, in the kernel regime, we leverage this reparameterization to generalize recent findings that relate shallow Neural Networks to the Radon transform, deriving an explicit formula for the implicit bias induced by a broad class of activation functions. Specifically, by utilizing the connection between the Radon transform and the Fourier transform, we interpret the kernel regime's inductive bias as minimizing a spectral seminorm that penalizes high-frequency components, in a manner dependent on the activation function. Finally, in the adaptive regime, we demonstrate the existence of local dynamical attractors that facilitate the formation of clusters of hyperplanes where the input to a neuron's activation function is zero, yielding alignment between many neurons' response functions. We confirm these theoretical results with simulations. All together, our work provides a deeper understanding of the mechanisms underlying the generalization capabilities of overparameterized neural networks and its relation with the implicit bias, offering potential pathways for designing more efficient and robust models.

The Spectral Bias of Shallow Neural Network Learning is Shaped by the Choice of Non-linearity

TL;DR

The paper tackles why overparameterized shallow networks generalize by examining the implicit bias imposed by activation non-linearities. It introduces a Radon Spline (RS) reparameterization that exposes a kernel-regime spectral penalty via a Radon–Fourier view and extends the analysis to an adaptive regime where zero-plane clusters emerge during training. Key contributions include the RS parameterization, a kernel-regime Radon seminorm with a Fourier interpretation, a method to design activations to realize desired spectral penalties, and a dynamic RS/Gaussian-like picture of adaptive learning with datapoint pinning validated by simulations and MNIST experiments. The work provides a mechanistic lens on spectral bias and generalization, offering principled guidance for activation design and training strategies in overparameterized nets.

Abstract

Despite classical statistical theory predicting severe overfitting, modern massively overparameterized neural networks still generalize well. This unexpected property is attributed to the network's so-called implicit bias, which describes its propensity to converge to solutions that generalize effectively, among the many possible that correctly label the training data. The aim of our research is to explore this bias from a new perspective, focusing on how non-linear activation functions contribute to shaping it. First, we introduce a reparameterization which removes a continuous weight rescaling symmetry. Second, in the kernel regime, we leverage this reparameterization to generalize recent findings that relate shallow Neural Networks to the Radon transform, deriving an explicit formula for the implicit bias induced by a broad class of activation functions. Specifically, by utilizing the connection between the Radon transform and the Fourier transform, we interpret the kernel regime's inductive bias as minimizing a spectral seminorm that penalizes high-frequency components, in a manner dependent on the activation function. Finally, in the adaptive regime, we demonstrate the existence of local dynamical attractors that facilitate the formation of clusters of hyperplanes where the input to a neuron's activation function is zero, yielding alignment between many neurons' response functions. We confirm these theoretical results with simulations. All together, our work provides a deeper understanding of the mechanisms underlying the generalization capabilities of overparameterized neural networks and its relation with the implicit bias, offering potential pathways for designing more efficient and robust models.

Paper Structure

This paper contains 29 sections, 4 theorems, 33 equations, 10 figures, 3 tables.

Key Result

Lemma 1

Figures (10)

  • Figure 1: Activation functions $\phi(z)$ and their spectral penalty factors $\rho_\phi(k)$. For Sinc, and Squared Sinc, $\rho_\phi(k)$ is infinite outside the interval $(-a,a)$, as indicated by the shaded region.
  • Figure 2: A bump function $f(x)$ and its contraction, $f_\varepsilon(x)$ for $\varepsilon=0.1$.
  • Figure 3: Local features of the loss surface slice $\widetilde{\ell}(\widetilde{\boldsymbol{\upxi}}_i|\ldots)$. Top: a valley; Middle: a ridge; Bottom: a pass-through crease. Left: heatmap of the loss. Right: 1-dimensional slices along numbered lines.
  • Figure 4: Datapoint pinning: the region near the intersection of two datapoint ellipses $\mathcal{E}_n$ and $\mathcal{E}_m$ where both boundaries are valley floors. Top Left: heatmap of the loss. Top Right: 1-dimensional slices along numbered lines. Bottom Left: The parameter-space trajectory of a breakplane following gradient descent on $\widetilde{\ell}(\widetilde{\boldsymbol{\upxi}}_i|\ldots)$. Starting at point , the breakplane follows a nearly-vertical trajectory (i.e. almost all change is in $\gamma_i$) until it meets a valley floor at , after which it remains confined to that valley floor, and is pinned by the corresponding datapoint. It then continues along the valley floor until it reaches the intersection point , which is a local minima. Bottom Right: the trajectory of the breakplane in data space, showing that the breakplane first moves towards the bottom datapoint, then is constrained to rotate around that datapoint until it becomes pinned by the top datapoint as well.
  • Figure 5: Cluster formation: Left:$v_t(\boldsymbol{\upxi},\gamma)$ on a 3-datapoint 2D example. Center: alternative visualization with maxima and minima corresponding to the sources and sinks of $v_t(\boldsymbol{\upxi},\gamma)$. Right: the breaklines, colored by delta-slope, and the 3 datapoints. Each row shows snapshots from different times throughout training.
  • ...and 5 more figures

Theorems & Definitions (8)

  • Lemma 1
  • proof
  • Lemma 2
  • proof
  • Lemma 3
  • proof
  • Lemma 4
  • proof