Boundary between noise and information applied to filtering neural network weight matrices

Max Staats; Matthias Thamm; Bernd Rosenow

Boundary between noise and information applied to filtering neural network weight matrices

Max Staats, Matthias Thamm, Bernd Rosenow

TL;DR

An algorithm for noise filtering is introduced, which both removes small singular values and reduces the magnitude of large singular values to counteract the effect of level repulsion between the noise and the information part of the spectrum.

Abstract

Deep neural networks have been successfully applied to a broad range of problems where overparametrization yields weight matrices which are partially random. A comparison of weight matrix singular vectors to the Porter-Thomas distribution suggests that there is a boundary between randomness and learned information in the singular value spectrum. Inspired by this finding, we introduce an algorithm for noise filtering, which both removes small singular values and reduces the magnitude of large singular values to counteract the effect of level repulsion between the noise and the information part of the spectrum. For networks trained in the presence of label noise, we indeed find that the generalization performance improves significantly due to noise filtering.

Boundary between noise and information applied to filtering neural network weight matrices

TL;DR

Abstract

Paper Structure (3 equations, 5 figures)

This paper contains 3 equations, 5 figures.

Figures (5)

Figure 1: Analysis of singular values $\nu$ and vectors $V$ of the first hidden layer weight matrix for the MLP1024 network trained with various amounts of label noise: 0% (blue), 40% (green), and 100% (brown). For reference, we show results for randomly initialized weights in red. The upper panel shows the randomness of singular vectors via the p-value of Kolmogorov-Smirnov tests against a Thomas-Porter distribution, averaged over neighboring singular values with a window size of 15; the light red stripe describes the $2\sigma$ region around the mean for random vectors . The lower panel depicts the corresponding singular value spectra obtained via Gaussian broadening with a window size of 15 (solid lines). The dashed line shows the fit of a Marchenko-Pastur distribution to the spectrum for 0% label noise.
Figure 2: Information-noise boundary, demonstrated by setting a given percentage of the singular values to zero. (a) Training accuracy for the MLP1024 network trained with various amounts of label noise (0% blue, 40% green, and 100% brown). (b) Training accuracy for setting singular values from the second convolutional layer of miniAlexNet trained with 0% label noise (blue) and 100% label noise (brown) to zero. (c) Test accuracy for miniAlexNet trained without label noise for setting singular values to zero in the first dense layer (orange) and the second convolutional layer (blue). (d) Test accuracy for the pre-trained networks vgg19 Simonyan.2014 (third dense layer, blue) and alexnet Krizhevsky.2017 (second dense layer, orange). In all cases, relevant information is stored in the largest singular values and corresponding vectors only. In presence of label noise larger parts of the spectrum are needed to store the noise.
Figure 3: Dependence of the test accuracy on the removal and shifting of singular values from the second hidden layer weights of MLP1024 networks trained in the presence of label noise: upon setting singular values to zero (blue) and when additionally shifting them according to Eq. (3) (green) we observe a significant improvement in performance. For training with overfitting (red) no improvement is observed, indicating that information and noise are mixed in the spectrum.
Figure 4: Shifting of singular values: histogram of singular values for the first hidden layer weight matrix of the MLP1024 network (blue) trained with 40% label noise, with boundaries of the Marchenko-Pastur region (dashed black lines). The dashed red lines show the locations of shifted singular values according to Eq. \ref{['eq:rescale_sVal']}, and the inset zooms into the tail region.
Figure 5: Average improvement of the test accuracy when removing singular values (blue, red) from all layers and when additionally shifting singular values (green) of the first two layers in MLP1024 networks, with results for both the learning rate schedule (blue crosses, green diamonds) and an overfitting schedule (red squares). We observe that the average improvements increase with increasing amount of label noise, with an enhanced improvement for additionally shifting singular values. There are no improvements for networks trained with overfitting.