Pruning-induced phases in fully-connected neural networks: the eumentia, the dementia, and the amentia

Haining Pan; Nakul Aggarwal; J. H. Pixley

Pruning-induced phases in fully-connected neural networks: the eumentia, the dementia, and the amentia

Haining Pan, Nakul Aggarwal, J. H. Pixley

Abstract

Modern neural networks are heavily overparameterized, and pruning, which removes redundant neurons or connections, has emerged as a key approach to compressing them without sacrificing performance. However, while practical pruning methods are well developed, whether pruning induces sharp phase transitions in the neural networks and, if so, to what universality class they belong, remain open questions. To address this, we study fully-connected neural networks trained on MNIST, independently varying the dropout (i.e., removing neurons) rate at both the training and evaluation stages to map the phase diagram. We identify three distinct phases: eumentia (the network learns), dementia (the network has forgotten), and amentia (the network cannot learn), sharply distinguished by the power-law scaling of the cross-entropy loss with the training dataset size. {In the eumentia phase, the algebraic decay of the loss, as documented in the machine learning literature as neural scaling laws, is from the perspective of statistical mechanics the hallmark of quasi-long-range order.} We demonstrate that the transition between the eumentia and dementia phases is accompanied by scale invariance, with a diverging length scale that exhibits hallmarks of a Berezinskii-Kosterlitz-Thouless-like transition; the phase structure is robust across different network widths and depths. Our results establish that dropout-induced pruning provides a concrete setting in which neural network behavior can be understood through the lens of statistical mechanics.

Pruning-induced phases in fully-connected neural networks: the eumentia, the dementia, and the amentia

Abstract

Paper Structure (15 sections, 20 equations, 15 figures)

This paper contains 15 sections, 20 equations, 15 figures.

Introduction
Model and Methods
fully-connected neural networks
Training and evaluation dropout rate
Results
Phase diagram
Exponents of the power-law scaling of the cross-entropy loss
BKT-like phase transition from eumentia to dementia phase
Accuracy
Generality of the phase diagram for different width and depth
Conclusions and outlook
finite-size scaling using BKT-like transition ansatz
finite-size scaling using conventional algebraic divergence
MNIST Dataset
Neuron Dropout

Figures (15)

Figure 1: (a) Schematic of the fully-connected neural network. A $28\times 28$ gray scale image from MNIST dataset is first flattened into a $784$-dimensional input vector, and then passed through a rectangular architecture of depth $L$ hidden layers and width $n_h$, where each hidden layer contains $n_h$ neurons. During training, the weights and biases are optimized on the training set by minimizing the cross-entropy loss using the Adam optimizer. During evaluation, these optimized parameters are frozen, and the network is evaluated to compute performance metrics including the classification accuracy $\langle\mathcal{A}\rangle$ and cross-entropy loss $\langle\mathcal{L}_{\text{CE}}\rangle$ on test dataset. The average values are obtained over $20$ independent training runs with different initializations and mini-batch samplings. (See Sec. \ref{['sec:model']} for details.) (b) Schematic of the neuron-dropout model. The network architecture is identical to that in panel (a), except neurons in the hidden layers are randomly masked by setting their post-activation outputs to zero. The neurons are dropped independently with dropout rates $p_{\mathrm{train}}$ and $p_{\mathrm{eval}}$ during training and evaluation stages respectively. For each fixed $p_{\text{eval}}$, we sample $100$ independent dropout masks and estimate the mean accuracy.
Figure 2: (a) Phase diagram as a function of the training dropout rate $p_{\text{train}}$ and evaluation dropout rate $p_{\text{eval}}$. The three phases eumentia, dementia, and amentia are characterized by the exponent of power-law decay in the cross-entropy loss as shown in panel (c); (b) Averaged accuracy $\expval{\mathcal{A}}$ for the training dataset size $N=60000$; (c) Exponent of cross-entropy loss $\alpha_{\text{CE}}$ defined in Eq. \ref{['eq:CE']} for the training dataset size $N=60000$, where the sign implies the three phases.
Figure 3: (a) Linecuts of the cross-entropy loss at a fixed evaluation dropout rate $p_{\text{eval}}=0.7$ as a function of the training dropout rate $p_{\text{train}}$ for different training dataset size $N$. The vertical dashed line indicates the phase boundaries. (b-d) Power-law scaling of the averaged cross-entropy loss $\expval{\mathcal{L}_{\text{CE}}}$ as a function of the training dataset size $N$ at different training dropout rates $p_{\text{train}}$ in the dementia, eumentia, and amentia phases, respectively. The solid lines are power-law fits to the data in the log-log scale following Eq. \ref{['eq:CE']}.
Figure 4: (a) Averaged cross-entropy loss as a function of evaluation dropout rate $p_{\text{eval}}$ at $p_{\text{train}}=0.1$ for different training dataset size $N$. Inset: the data collapse of the averaged cross-entropy loss $\expval{\mathcal{L}_{\text{CE}}}$ with the fitted value of $p_{\text{eval}, c}=0.60(2)$ and $\sigma=0.8(1)$. (b) Goodness of the data collapse $\chi_\nu^2$ as a function of the fitting parameter $\sigma$ and $p_{\text{eval}, c}$, where $\chi_{\nu,\min}^2$ is marked by the cross with value of 26.6, and the contour shows $1.3\chi_{\nu,\min}^2$. (c) Left axis: the fitted exponent $\sigma$ from the BKT-like transition ansatz in Eq. \ref{['eq:bkt']} as a function of the training dropout rate $p_{\text{train}}$; right axis: the corresponding goodness of the data collapse $\chi_\nu^2$.
Figure 5: (a) Averaged accuracy $\expval{\mathcal{A}}$ at a fixed training dropout rate $p_{\text{train}}=0.1$ as a function of the evaluation dropout rate $p_{\text{eval}}$ for different training dataset size $N$. The vertical dashed line marks the critical point $p_{\text{eval}, c}$ obtained from Fig. \ref{['fig:CE_linecut_BKT']} for all the panels. The horizontal dashed line marks the accuracy from random guessing. (b) Binder cumulant (see Eq. \ref{['eq:binder']} for the definition) of the same data in panel (a). The horizontal dashed line marks the expected value for a Gaussian variable.
...and 10 more figures

Pruning-induced phases in fully-connected neural networks: the eumentia, the dementia, and the amentia

Abstract

Pruning-induced phases in fully-connected neural networks: the eumentia, the dementia, and the amentia

Authors

Abstract

Table of Contents

Figures (15)