Table of Contents
Fetching ...

Understanding the dynamics of the frequency bias in neural networks

Juan Molina, Mircea Petrache, Francisco Sahli Costabal, Matías Courdurier

TL;DR

This work tackles the spectral bias observed in neural network training by deriving a PDE describing the frequency dynamics of a two-layer Fourier Features network within the Neural Tangent Kernel regime. The dynamics reduce to a damped heat equation in Fourier space, with the diffusion and damping controlled by the initialization distribution ρ_w and the weight variance σ_a^2; this yields a clear mechanism for how different frequencies are learned over time. Empirical validation via finite element method simulations of the PDE and direct neural-network training confirms that appropriate choices of initialization can mitigate or eliminate frequency bias, and the results extend to multilayer FF-NN settings. The findings provide a principled initialization design principle to achieve balanced frequency learning, with implications for neural field representations and NTK-based analyses, and open avenues for optimizing learning dynamics under alternative training schemes.

Abstract

Recent works have shown that traditional Neural Network (NN) architectures display a marked frequency bias in the learning process. Namely, the NN first learns the low-frequency features before learning the high-frequency ones. In this study, we rigorously develop a partial differential equation (PDE) that unravels the frequency dynamics of the error for a 2-layer NN in the Neural Tangent Kernel regime. Furthermore, using this insight, we explicitly demonstrate how an appropriate choice of distributions for the initialization weights can eliminate or control the frequency bias. We focus our study on the Fourier Features model, an NN where the first layer has sine and cosine activation functions, with frequencies sampled from a prescribed distribution. In this setup, we experimentally validate our theoretical results and compare the NN dynamics to the solution of the PDE using the finite element method. Finally, we empirically show that the same principle extends to multi-layer NNs.

Understanding the dynamics of the frequency bias in neural networks

TL;DR

This work tackles the spectral bias observed in neural network training by deriving a PDE describing the frequency dynamics of a two-layer Fourier Features network within the Neural Tangent Kernel regime. The dynamics reduce to a damped heat equation in Fourier space, with the diffusion and damping controlled by the initialization distribution ρ_w and the weight variance σ_a^2; this yields a clear mechanism for how different frequencies are learned over time. Empirical validation via finite element method simulations of the PDE and direct neural-network training confirms that appropriate choices of initialization can mitigate or eliminate frequency bias, and the results extend to multilayer FF-NN settings. The findings provide a principled initialization design principle to achieve balanced frequency learning, with implications for neural field representations and NTK-based analyses, and open avenues for optimizing learning dynamics under alternative training schemes.

Abstract

Recent works have shown that traditional Neural Network (NN) architectures display a marked frequency bias in the learning process. Namely, the NN first learns the low-frequency features before learning the high-frequency ones. In this study, we rigorously develop a partial differential equation (PDE) that unravels the frequency dynamics of the error for a 2-layer NN in the Neural Tangent Kernel regime. Furthermore, using this insight, we explicitly demonstrate how an appropriate choice of distributions for the initialization weights can eliminate or control the frequency bias. We focus our study on the Fourier Features model, an NN where the first layer has sine and cosine activation functions, with frequencies sampled from a prescribed distribution. In this setup, we experimentally validate our theoretical results and compare the NN dynamics to the solution of the PDE using the finite element method. Finally, we empirically show that the same principle extends to multi-layer NNs.
Paper Structure (21 sections, 6 theorems, 55 equations, 6 figures)

This paper contains 21 sections, 6 theorems, 55 equations, 6 figures.

Key Result

Theorem 3.1

Under the assumptions assu, and assuming that $g \in H^{1}$, the dynamics (eq:dinamicalinealespacioimagen) can be expressed in frequency space (in the sense of distributions, i.e. in duality with an arbitrary Schwarz test function $\psi\in \mathcal{S}(\mathbb R^d)$) as: where $\hat{u}_{\rho_{\text{data}}}=\mathcal{F}[u\rho_{\text{data}}]$.

Figures (6)

  • Figure 1: Neural network of 2 layers with Fourier Features model.
  • Figure 2: Target function $\widetilde{f}$ used in our experiments and the magnitude of its Fourier transform.
  • Figure 3: Frequency learning rate $\kappa(\boldsymbol{\xi})$ for different initialization distributions $\rho_{\boldsymbol{w}}(\boldsymbol{\xi})$. Left panel: normal distribution with different standard deviation $\sigma_{\boldsymbol{w}}$, right panel: uniform distribution with different widths $R$.
  • Figure 4: Comparation between FEM simulations of \ref{['eq:dynamicsFF']} and NN actual dynamics.
  • Figure 5: Frequency learning rate $\kappa(\boldsymbol{\xi})$ on 3-layer and 4-layer neural networks on NTK regime. The hidden layers have all equal widths of 4000.
  • ...and 1 more figures

Theorems & Definitions (14)

  • Theorem 3.1: proved as Thm. \ref{['thm:evol_app']}
  • Corollary 4.1: Proved as Cor. \ref{['cor:PDE_app']}
  • Remark 4.2
  • Definition A.1
  • Lemma A.2
  • proof
  • Lemma A.3
  • proof
  • Theorem A.4: cf. Thm. \ref{['thm:evol']}
  • proof
  • ...and 4 more