A rationale from frequency perspective for grokking in training neural network

Zhangchen Zhou; Yaoyu Zhang; Zhi-Qin John Xu

A rationale from frequency perspective for grokking in training neural network

Zhangchen Zhou, Yaoyu Zhang, Zhi-Qin John Xu

TL;DR

This paper empirically provides a frequency perspective to explain the emergence of grokking in NNs, observing that the networks initially learn the less salient frequency components present in the test data.

Abstract

Grokking is the phenomenon where neural networks NNs initially fit the training data and later generalize to the test data during training. In this paper, we empirically provide a frequency perspective to explain the emergence of this phenomenon in NNs. The core insight is that the networks initially learn the less salient frequency components present in the test data. We observe this phenomenon across both synthetic and real datasets, offering a novel viewpoint for elucidating the grokking phenomenon by characterizing it through the lens of frequency dynamics during the training process. Our empirical frequency-based analysis sheds new light on understanding the grokking phenomenon and its underlying mechanisms.

A rationale from frequency perspective for grokking in training neural network

TL;DR

Abstract

Paper Structure (24 sections, 4 equations, 17 figures)

This paper contains 24 sections, 4 equations, 17 figures.

Introduction
Related Works
Grokking
Frequency Principle
Preliminaries
Notations
Nonuniform discrete Fourier transform
One-dimensional synthetic data
Dataset
Experiment Settings
The frequency spectrum of the synthetic data
Parity Function
Parity Function
Experiment Settings
The frequency spectrum of parity function
...and 9 more sections

Figures (17)

Figure 1: (a) (b) is the train and test loss of $n=65$ and $n=1000$ nonuniform experiment, respectively. (c) is the train and test loss of $n=65$ uniform experiment. The activation function is $\sin x$. Each experiment is averaged over $10$ trials and the shallow parts represent the standard deviation.
Figure 2: During training with a set of $n=65$ non-uniformly sampled data points, the learned output function evolves across epochs as shown in (a)-(c) for $0$, $2000$, and $35000$ epochs, respectively. The blue stars represent the exact training data points, the green dots are the network's outputs on the training data, and the red curve shows the overall learned output function, drawing on $1000$ evenly-spaced data points.
Figure 3: The evolution of the frequency spectrum during training for Fig. \ref{['fig:train_test_output_65']}. The columns from left to right correspond to epochs $0$, $2000$, and $35000$, respectively. The top row illustrates the frequency spectra of the target function (orange solid lines) and the network's output (blue solid lines) on the training data. The bottom row shows the frequency spectra of the target function (orange solid lines) and the network's output (blue solid lines) on the test data. The ordinate represents frequency and the abscissa represents the amplitude of the corresponding frequency components.
Figure 4: (a) Loss for ($10,[10]$) parity function with different train data proportion respectively. The blue solid lines are the training dataset, and the orange solid lines are the test dataset. (b) The frequency spectra of the training set over different proportion ratios. The blue solid lines are the frequency spectra of the training dataset, and the orange solid line is the frequency spectrum of all data. The proportion ratios of the training-test dataset are $0.2$, $0.5$, and $0.8$, from shallow to deep. The ordinate represents frequency and the abscissa represents the amplitude of the corresponding frequency components. Each experiment is averaged over $10$ trials and the shallow parts represent the standard deviation.
Figure 5: (a) The train and test loss for a specific experiment with proportion ration $0.5$. The ordinate represents epochs and the abscissa represents the loss. The grey background represents the selected intervals for (c)(d). (b) The frequency spectrum difference between the training dataset and the whole dataset. (c)(d) The frequency spectrum of the whole dataset. The orange solid line is the exact frequency spectrum of the $(10,[10])$ parity function. The blue solid lines are the frequency spectra during the training. From shallow to deep corresponds to an increase in epochs, with (c) recording every $20$ epochs, and (d) recording every $100$ epochs. The ordinate represents frequency and the abscissa represents the amplitude of the corresponding frequency components.
...and 12 more figures

A rationale from frequency perspective for grokking in training neural network

TL;DR

Abstract

A rationale from frequency perspective for grokking in training neural network

Authors

TL;DR

Abstract

Table of Contents

Figures (17)