A Three-regime Model of Network Pruning

Yefan Zhou; Yaoqing Yang; Arin Chang; Michael W. Mahoney

A Three-regime Model of Network Pruning

Yefan Zhou, Yaoqing Yang, Arin Chang, Michael W. Mahoney

TL;DR

A phenomenological model grounded in the statistical mechanics of learning uses temperature-like and load-like parameters to model the impact of neural network (NN) training hyperparameters on pruning performance and reveals that the dichotomous effect of high temperature is associated with transitions between distinct types of global structures in the post-pruned model.

Abstract

Recent work has highlighted the complex influence training hyperparameters, e.g., the number of training epochs, can have on the prunability of machine learning models. Perhaps surprisingly, a systematic approach to predict precisely how adjusting a specific hyperparameter will affect prunability remains elusive. To address this gap, we introduce a phenomenological model grounded in the statistical mechanics of learning. Our approach uses temperature-like and load-like parameters to model the impact of neural network (NN) training hyperparameters on pruning performance. A key empirical result we identify is a sharp transition phenomenon: depending on the value of a load-like parameter in the pruned model, increasing the value of a temperature-like parameter in the pre-pruned model may either enhance or impair subsequent pruning performance. Based on this transition, we build a three-regime model by taxonomizing the global structure of the pruned NN loss landscape. Our model reveals that the dichotomous effect of high temperature is associated with transitions between distinct types of global structures in the post-pruned model. Based on our results, we present three case-studies: 1) determining whether to increase or decrease a hyperparameter for improved pruning; 2) selecting the best model to prune from a family of models; and 3) tuning the hyperparameter of the Sharpness Aware Minimization method for better pruning performance.

A Three-regime Model of Network Pruning

TL;DR

Abstract

Paper Structure (30 sections, 2 equations, 17 figures, 2 tables, 2 algorithms)

This paper contains 30 sections, 2 equations, 17 figures, 2 tables, 2 algorithms.

Introduction
Background and Setup
Load and temperature
Preliminaries
Network pruning
Loss landscape metrics
Validity of the Three-regime Model
Experimental setup
Regimes of loss landscape
Corroborating results
Application of the Three-regime Model
Determining how to adjust temperature parameters
Selecting the best model without grid search
Tuning the temperature of the SAM method
Conclusion
...and 15 more sections

Figures (17)

Figure 1: The three regimes of pruning obtained by varying temperature-like parameters (in the dense pre-pruned model) and load-like parameters (in the sparse post-pruned model): loss landscape connectivity metrics such as LMC identify Regime I versus Regime II; and loss landscape similarity metrics of outputs between models in well-connected regimes then identify Regimes II-A and II-B. The regimes are thus Regime I (poorly-connected loss landscapes); Regime II-A (well-connected but relatively dissimilar model outputs); and Regime II-B (well-connected and relatively similar outputs). For a given load goal (density of the pruned model), we focus on the favorable transitions from Regime I to Regime II-A (obtained by increasing the temperature) and from Regime II-A to Regime II-B (obtained by decreasing the temperature), as indicated by the arrows.
Figure 2: Partitioning the 2D model density (load) -- training epoch (temperature) diagram into three regimes. Models are trained with PreResNet-20 on CIFAR-10. The $y$-axis denotes a temperature-like parameter, indicated by a range of training epochs preceding the pruning process, while the $x$-axis represents a load-like parameter, expressed through diverse model densities applied to the model. (a) Final test error of the models after pruning and retraining. (b) Normalized test error is obtained by subtracting the optimal (lowest) test error from each column of the diagram in (a). The black arrows indicate two favorable transitions to lower test error regimes given a fixed model density. (c) LMC forms a sharper boundary that distinguishes Regime I from Regime II. (d) CKA shows a smooth transition that categorizes Regimes II-A and II-B.
Figure 3: Using LMC to determine the right direction to adjust the temperature: models with negative LMC are located in Regime I (annotated by the black box), and their test error can be reduced by increasing temperature. Otherwise, models with close-to-zero LMC benefit from decreasing temperature. Note that a smaller training epoch or a smaller batch size corresponds to a higher temperature.
Figure 4: Selecting temperature using the LMC-based method (squares) leads to a smaller test error than selecting temperature using the test error of the unpruned dense model (crosses). The performance of LMC-based selection is close to the best test error found by grid search (dashed lines). (Left) Selecting the best training epoch. (Right) Selecting the best batch size. Models that perform significantly worse than grid search tend to have worse LMC, shown by the dark color of markers.
Figure 5: Comparing final pruning performance with dense model training using SGD versus SAM in Regime I and Regime II.
...and 12 more figures

A Three-regime Model of Network Pruning

TL;DR

Abstract

A Three-regime Model of Network Pruning

Authors

TL;DR

Abstract

Table of Contents

Figures (17)