Table of Contents
Fetching ...

Implicit Self-Regularization in Deep Neural Networks: Evidence from Random Matrix Theory and Implications for Learning

Charles H. Martin, Michael W. Mahoney

TL;DR

The paper introduces a Random Matrix Theory–based framework to explain implicit Self-Regularization in deep neural networks, arguing that training induces multi-scale correlations in weight matrices that resemble Heavy-Tailed random matrices rather than purely random Gaussian structures. By analyzing the empirical spectral densities of layer weight matrices across small and large networks, the authors identify a 5+1 phased taxonomy of training (Random-like, Bleeding-out, Bulk+Spikes, Bulk-decay, Heavy-Tailed, Rank-collapse) governed by MP theory and its heavy-tailed extensions, with the MP Soft Rank serving as a practical diagnostic. They demonstrate that smaller batch sizes push models into more strongly regularized phases, providing a plausible mechanism for the generalization gap, and show how explicit regularization shifts are reflected in the ESDs and eigenvector localization. The work offers a practical, predictive theory linking training dynamics to spectral properties, with broad implications for model design, generalization, and the interpretation of deep learning through the lens of disordered systems and self-organization.

Abstract

Random Matrix Theory (RMT) is applied to analyze weight matrices of Deep Neural Networks (DNNs), including both production quality, pre-trained models such as AlexNet and Inception, and smaller models trained from scratch, such as LeNet5 and a miniature-AlexNet. Empirical and theoretical results clearly indicate that the DNN training process itself implicitly implements a form of Self-Regularization. The empirical spectral density (ESD) of DNN layer matrices displays signatures of traditionally-regularized statistical models, even in the absence of exogenously specifying traditional forms of explicit regularization. Building on relatively recent results in RMT, most notably its extension to Universality classes of Heavy-Tailed matrices, we develop a theory to identify 5+1 Phases of Training, corresponding to increasing amounts of Implicit Self-Regularization. These phases can be observed during the training process as well as in the final learned DNNs. For smaller and/or older DNNs, this Implicit Self-Regularization is like traditional Tikhonov regularization, in that there is a "size scale" separating signal from noise. For state-of-the-art DNNs, however, we identify a novel form of Heavy-Tailed Self-Regularization, similar to the self-organization seen in the statistical physics of disordered systems. This results from correlations arising at all size scales, which arises implicitly due to the training process itself. This implicit Self-Regularization can depend strongly on the many knobs of the training process. By exploiting the generalization gap phenomena, we demonstrate that we can cause a small model to exhibit all 5+1 phases of training simply by changing the batch size. This demonstrates that---all else being equal---DNN optimization with larger batch sizes leads to less-well implicitly-regularized models, and it provides an explanation for the generalization gap phenomena.

Implicit Self-Regularization in Deep Neural Networks: Evidence from Random Matrix Theory and Implications for Learning

TL;DR

The paper introduces a Random Matrix Theory–based framework to explain implicit Self-Regularization in deep neural networks, arguing that training induces multi-scale correlations in weight matrices that resemble Heavy-Tailed random matrices rather than purely random Gaussian structures. By analyzing the empirical spectral densities of layer weight matrices across small and large networks, the authors identify a 5+1 phased taxonomy of training (Random-like, Bleeding-out, Bulk+Spikes, Bulk-decay, Heavy-Tailed, Rank-collapse) governed by MP theory and its heavy-tailed extensions, with the MP Soft Rank serving as a practical diagnostic. They demonstrate that smaller batch sizes push models into more strongly regularized phases, providing a plausible mechanism for the generalization gap, and show how explicit regularization shifts are reflected in the ESDs and eigenvector localization. The work offers a practical, predictive theory linking training dynamics to spectral properties, with broad implications for model design, generalization, and the interpretation of deep learning through the lens of disordered systems and self-organization.

Abstract

Random Matrix Theory (RMT) is applied to analyze weight matrices of Deep Neural Networks (DNNs), including both production quality, pre-trained models such as AlexNet and Inception, and smaller models trained from scratch, such as LeNet5 and a miniature-AlexNet. Empirical and theoretical results clearly indicate that the DNN training process itself implicitly implements a form of Self-Regularization. The empirical spectral density (ESD) of DNN layer matrices displays signatures of traditionally-regularized statistical models, even in the absence of exogenously specifying traditional forms of explicit regularization. Building on relatively recent results in RMT, most notably its extension to Universality classes of Heavy-Tailed matrices, we develop a theory to identify 5+1 Phases of Training, corresponding to increasing amounts of Implicit Self-Regularization. These phases can be observed during the training process as well as in the final learned DNNs. For smaller and/or older DNNs, this Implicit Self-Regularization is like traditional Tikhonov regularization, in that there is a "size scale" separating signal from noise. For state-of-the-art DNNs, however, we identify a novel form of Heavy-Tailed Self-Regularization, similar to the self-organization seen in the statistical physics of disordered systems. This results from correlations arising at all size scales, which arises implicitly due to the training process itself. This implicit Self-Regularization can depend strongly on the many knobs of the training process. By exploiting the generalization gap phenomena, we demonstrate that we can cause a small model to exhibit all 5+1 phases of training simply by changing the batch size. This demonstrates that---all else being equal---DNN optimization with larger batch sizes leads to less-well implicitly-regularized models, and it provides an explanation for the generalization gap phenomena.

Paper Structure

This paper contains 86 sections, 27 equations, 31 figures, 5 tables.

Figures (31)

  • Figure 1: The behavior of two complexity measures, the Matrix Entropy $\mathcal{S}(\mathbf{W})$ and the Stable Rank $\mathcal{R}_{s}(\mathbf{W})$, for Layers FC1 and FC2, during Backprop training, for MLP3. Both measures display a transition during Backprop training.
  • Figure 2: Scree plots for initial and final configurations for Layers FC1 and FC2, during Backprop training, for MLP3.
  • Figure 3: Histograms of the Singular Values $\nu_{i}$ and associated Eigenvalues $\lambda_{i}=\nu^{2}_{i}$, comparing initial $\mathbf{W}^{0}_{l}$ and final $\mathbf{W}_{l}$ weight matrices (which are $N \times M$, with $N=M$) for Layer FC2 of a MLP3 trained on CIFAR10.
  • Figure 4: Marchenko-Pastur (MP) distributions, see Eqns. (\ref{['eqn:mp_distribution']}) and (\ref{['eqn:lambda_pm']}), as the aspect ratio $Q$ and variance parameter $\sigma$ are modified.
  • Figure 5: The log-log histogram plots of the ESD for three Heavy-Tailed random matrices $\mathbf{M}$ with same aspect ratio $Q=3$, with $\mu=1.0, 3.0, 5.0$, corresponding to the three Heavy-Tailed Universality classes ($0<\mu < 2$ vs $2 < \mu < 4$ and $4 < \mu$) described in Table \ref{['table:mp_vanilla_spiked_ht']}.
  • ...and 26 more figures