Table of Contents
Fetching ...

Neural network initialization with nonlinear characteristics and information on spectral bias

Hikaru Homma, Jun Ohkubo

TL;DR

The paper addresses the impact of initialization on neural network training by integrating spectral-bias information into SWIM-based parameter initialization. It introduces a per-layer scheduling of the nonlinearity scale factors $s_{1,l}$ (with $s_{2,l}=\tfrac{1}{2}s_{1,l}$) to encode coarse information in early layers and fine details in later layers, yielding improved performance on both a 1D regression task and MNIST classification without gradient-based training. Empirical results show that the proposed ordered scheduling outperforms the original SWIM and reversed schemes when network width is large, highlighting the practical value of leveraging intrinsic spectral properties. The work suggests future extensions to other architectures and hyperparameter optimization to further harness spectral-bias effects in data-driven initializations.

Abstract

Initialization of neural network parameters, such as weights and biases, has a crucial impact on learning performance; if chosen well, we can even avoid the need for additional training with backpropagation. For example, algorithms based on the ridgelet transform or the SWIM (sampling where it matters) concept have been proposed for initialization. On the other hand, it is well-known that neural networks tend to learn coarse information in the earlier layers. The feature is called spectral bias. In this work, we investigate the effects of utilizing information on the spectral bias in the initialization of neural networks. Hence, we propose a framework that adjusts the scale factors in the SWIM algorithm to capture low-frequency components in the early-stage hidden layers and to represent high-frequency components in the late-stage hidden layers. Numerical experiments on a one-dimensional regression task and the MNIST classification task demonstrate that the proposed method outperforms the conventional initialization algorithms. This work clarifies the importance of intrinsic spectral properties in learning neural networks, and the finding yields an effective parameter initialization strategy that enhances their training performance.

Neural network initialization with nonlinear characteristics and information on spectral bias

TL;DR

The paper addresses the impact of initialization on neural network training by integrating spectral-bias information into SWIM-based parameter initialization. It introduces a per-layer scheduling of the nonlinearity scale factors (with ) to encode coarse information in early layers and fine details in later layers, yielding improved performance on both a 1D regression task and MNIST classification without gradient-based training. Empirical results show that the proposed ordered scheduling outperforms the original SWIM and reversed schemes when network width is large, highlighting the practical value of leveraging intrinsic spectral properties. The work suggests future extensions to other architectures and hyperparameter optimization to further harness spectral-bias effects in data-driven initializations.

Abstract

Initialization of neural network parameters, such as weights and biases, has a crucial impact on learning performance; if chosen well, we can even avoid the need for additional training with backpropagation. For example, algorithms based on the ridgelet transform or the SWIM (sampling where it matters) concept have been proposed for initialization. On the other hand, it is well-known that neural networks tend to learn coarse information in the earlier layers. The feature is called spectral bias. In this work, we investigate the effects of utilizing information on the spectral bias in the initialization of neural networks. Hence, we propose a framework that adjusts the scale factors in the SWIM algorithm to capture low-frequency components in the early-stage hidden layers and to represent high-frequency components in the late-stage hidden layers. Numerical experiments on a one-dimensional regression task and the MNIST classification task demonstrate that the proposed method outperforms the conventional initialization algorithms. This work clarifies the importance of intrinsic spectral properties in learning neural networks, and the finding yields an effective parameter initialization strategy that enhances their training performance.

Paper Structure

This paper contains 11 sections, 10 equations, 8 figures, 3 tables, 1 algorithm.

Figures (8)

  • Figure 1: The parameter initialization in the SWIM algorithm. The parameters in the hidden layers are initialized by using the dataset. Only the parameters in the output layer are learned by the conventional linear regression.
  • Figure 2: (Color online) Outputs of the learned fully connected neural network for each layer. The horizontal axis is the input $x$, and the vertical axis represents values with a matrix product of the weights of the final layer $W_{L+1}$ to each hidden layer. The dashed curve corresponds to the original function $f(x)$. Since the learning is effectively finished, the output of the final layer, i.e., layer 4, matches the original function $f(x)$ well.
  • Figure 3: (Color online) Example of different choices with the parameter for $s_{1}$. Even if the same data pair $(\bm{x}_{1}, \bm{x}_{2})$ is used, the degree of nonlinearity utilized varies depending on the parameters. When $s'_1 < s_{1}$, the small parameter $s'_1$ utilizes only the portion where the activation function changes abruptly.
  • Figure 4: Framework to include the information on spectral bias. The relevant hyperparameter $s_1$ can vary across hidden layers. The "ordered" is the proposed method, in which $s_{1,l}$ gradually increases with the layer index $l$. By contrast, $s_{1,l}$ gradually decreases in the "reversed" method. The "normal" corresponds to the original SWIM algorithm in which $s_{1,l}$ is constant.
  • Figure 5: The RMSE and standard deviation of the neural network outputs for each method, averaged over 5 runs. The horizontal axis is the number of nodes in each hidden layer, and the vertical axis corresponds to the RMSE. While there are error bars depicted by the standard deviation, it is difficult to see them because they are too small.
  • ...and 3 more figures