The Spectral Bias of Shallow Neural Network Learning is Shaped by the Choice of Non-linearity
Justin Sahs, Ryan Pyle, Fabio Anselmi, Ankit Patel
TL;DR
The paper tackles why overparameterized shallow networks generalize by examining the implicit bias imposed by activation non-linearities. It introduces a Radon Spline (RS) reparameterization that exposes a kernel-regime spectral penalty via a Radon–Fourier view and extends the analysis to an adaptive regime where zero-plane clusters emerge during training. Key contributions include the RS parameterization, a kernel-regime Radon seminorm with a Fourier interpretation, a method to design activations to realize desired spectral penalties, and a dynamic RS/Gaussian-like picture of adaptive learning with datapoint pinning validated by simulations and MNIST experiments. The work provides a mechanistic lens on spectral bias and generalization, offering principled guidance for activation design and training strategies in overparameterized nets.
Abstract
Despite classical statistical theory predicting severe overfitting, modern massively overparameterized neural networks still generalize well. This unexpected property is attributed to the network's so-called implicit bias, which describes its propensity to converge to solutions that generalize effectively, among the many possible that correctly label the training data. The aim of our research is to explore this bias from a new perspective, focusing on how non-linear activation functions contribute to shaping it. First, we introduce a reparameterization which removes a continuous weight rescaling symmetry. Second, in the kernel regime, we leverage this reparameterization to generalize recent findings that relate shallow Neural Networks to the Radon transform, deriving an explicit formula for the implicit bias induced by a broad class of activation functions. Specifically, by utilizing the connection between the Radon transform and the Fourier transform, we interpret the kernel regime's inductive bias as minimizing a spectral seminorm that penalizes high-frequency components, in a manner dependent on the activation function. Finally, in the adaptive regime, we demonstrate the existence of local dynamical attractors that facilitate the formation of clusters of hyperplanes where the input to a neuron's activation function is zero, yielding alignment between many neurons' response functions. We confirm these theoretical results with simulations. All together, our work provides a deeper understanding of the mechanisms underlying the generalization capabilities of overparameterized neural networks and its relation with the implicit bias, offering potential pathways for designing more efficient and robust models.
