Table of Contents
Fetching ...

Understanding training and generalization in deep learning by Fourier analysis

Zhiqin John Xu

TL;DR

This work studies DNN training by Fourier analysis to explain why Deep Neural Networks often achieve remarkably low generalization error and suggests small initialization leads to good generalization ability of DNN while preserving the DNN's ability to fit any function.

Abstract

Background: It is still an open research area to theoretically understand why Deep Neural Networks (DNNs)---equipped with many more parameters than training data and trained by (stochastic) gradient-based methods---often achieve remarkably low generalization error. Contribution: We study DNN training by Fourier analysis. Our theoretical framework explains: i) DNN with (stochastic) gradient-based methods often endows low-frequency components of the target function with a higher priority during the training; ii) Small initialization leads to good generalization ability of DNN while preserving the DNN's ability to fit any function. These results are further confirmed by experiments of DNNs fitting the following datasets, that is, natural images, one-dimensional functions and MNIST dataset.

Understanding training and generalization in deep learning by Fourier analysis

TL;DR

This work studies DNN training by Fourier analysis to explain why Deep Neural Networks often achieve remarkably low generalization error and suggests small initialization leads to good generalization ability of DNN while preserving the DNN's ability to fit any function.

Abstract

Background: It is still an open research area to theoretically understand why Deep Neural Networks (DNNs)---equipped with many more parameters than training data and trained by (stochastic) gradient-based methods---often achieve remarkably low generalization error. Contribution: We study DNN training by Fourier analysis. Our theoretical framework explains: i) DNN with (stochastic) gradient-based methods often endows low-frequency components of the target function with a higher priority during the training; ii) Small initialization leads to good generalization ability of DNN while preserving the DNN's ability to fit any function. These results are further confirmed by experiments of DNNs fitting the following datasets, that is, natural images, one-dimensional functions and MNIST dataset.

Paper Structure

This paper contains 21 sections, 6 theorems, 67 equations, 6 figures.

Key Result

Theorem 1

Consider a DNN with one hidden layer using tanh function $\sigma(x)$ as the activation function. For any frequencies $k_{1}$ and $k_{2}$ such that $k_{2}>k_{1}>0$ and there exist $c_{1},c_{2},$ such that $A(k_{1})>c_{1}>0$, $A(k_{2})<c_{2}<\infty$, we have where $B_{\delta}$ is a ball with radius $\delta$ centered at the origin and $\mu(\cdot)$ is the Lebesgue measure of a set.

Figures (6)

  • Figure 1: Magnitude of DNN parameters during fitting MNIST dataset. DNN parameters are initialized by Gaussian distribution with mean $0$ and standard deviation 0.06, 0.2, 0.6 for (a, b, c), respectively. Solids lines show the mean magnitude of the absolute weights (red) and the absolute bias (green) at each training epoch. The dashed lines are the mean$\pm$std for the corresponding color. Note that the green and the red lines almost overlap. We use a tanh DNN with width: 800-400-200-100. The learning rate is $10^{-5}$ with batch size $400$.
  • Figure 2: Convergence from low to high frequency for a natural image. The training data are all pixels whose horizontal indexes are odd. (a) True image. (b) $|F[f]|$ of the red dashed pixels in (a) as a function of frequency index---Note that for DFT, we can refer to a frequency component by the frequency index instead of its physical frequency---with selected peaks marked by black dots. (c) $\Delta_{F}$ at different training epochs for different selected frequency peaks in (b). (d) $|F[f]|$ (red) and $|F[\Upsilon]|$ (green) at epoch 1369. (e-g) DNN outputs of all pixels at different training steps. (h) Loss functions. We use a DNN with width 500-400-300-200-200-100-100. We train the DNN with the full batch and learning rate $2\times10^{-5}$. We initialize DNN parameters by Gaussian distribution with mean $0$ and standard deviation $0.08$.
  • Figure 3: Analysis of the training process of DNNs with large initialization while fitting the image in Fig.\ref{['fig:LowDominate']}a. The weights of DNNs are initialized by a Gaussian distribution with mean $0$ and standard deviation $0.5$. (a) The DNN outputs at the training pixels (left) and all pixels (right). (b) Loss functions. (c) DNN outputs of training data at the red dashed position in (a). (d) DNN outputs including test data at the red dashed position in (a). (e) $\Delta_{F}$ at different training epochs for different selected frequency peaks in Fig.\ref{['fig:LowDominate']}b.
  • Figure 4: Analysis of the training process of DNNs with different initialization while fitting MNIST dataset. Illustrations are the prediction accuracy on the training data and the test data at different training epochs. We use a tanh DNN with width: 800-400-200-100. The learning rate is $10^{-5}$ with batch size $400$. DNN parameters are initialized by Gaussian distribution with mean $0$. The legend $(\cdot,\cdot)$ denotes standard deviations of weights and bias terms, respectively.
  • Figure 5: Convergence from low frequency to high frequency for a 1-d function while the spectral norm almost does not change. (a) The target function. (b) $|F[f]|$ (red solid line) as a function of frequency index with important peaks marked by black dots. (c) Spectral norm of all weights. As the same as rahaman2018spectral, for matrix-valued weights, their spectral norm was computed by evaluating the eigenvalue of the eigenvector. For vector-valued weights, we simply use the $L_{2}$ norm. (d) $\Delta_{F}$ at different recording steps for different selected frequency peaks in (b). The training data are evenly sampled in $[-10,10]$ with sample size 120. We use a DNN with width: 800-800-800-800-800-800-500-100. We train the DNN with the full batch and learning rate $2\times10^{-6}$. We initialize DNN parameters by Gaussian distribution with mean $0$ and standard deviation $0.1$.
  • ...and 1 more figures

Theorems & Definitions (9)

  • Theorem 1
  • Theorem 2
  • Theorem 3
  • Theorem
  • proof
  • Theorem
  • proof
  • Theorem
  • proof