Effective Layer Pruning Through Similarity Metric Perspective

Ian Pons; Bruno Yamamoto; Anna H. Reali Costa; Artur Jordao

Effective Layer Pruning Through Similarity Metric Perspective

Ian Pons, Bruno Yamamoto, Anna H. Reali Costa, Artur Jordao

TL;DR

This work introduces an effective layer-pruning strategy that meets all underlying properties pursued by pruning methods and outperforms existing layer-pruning strategies and other state-of-the-art pruning techniques.

Abstract

Deep neural networks have been the predominant paradigm in machine learning for solving cognitive tasks. Such models, however, are restricted by a high computational overhead, limiting their applicability and hindering advancements in the field. Extensive research demonstrated that pruning structures from these models is a straightforward approach to reducing network complexity. In this direction, most efforts focus on removing weights or filters. Studies have also been devoted to layer pruning as it promotes superior computational gains. However, layer pruning often hurts the network predictive ability (i.e., accuracy) at high compression rates. This work introduces an effective layer-pruning strategy that meets all underlying properties pursued by pruning methods. Our method estimates the relative importance of a layer using the Centered Kernel Alignment (CKA) metric, employed to measure the similarity between the representations of the unpruned model and a candidate layer for pruning. We confirm the effectiveness of our method on standard architectures and benchmarks, in which it outperforms existing layer-pruning strategies and other state-of-the-art pruning techniques. Particularly, we remove more than 75% of computation while improving predictive ability. At higher compression regimes, our method exhibits negligible accuracy drop, while other methods notably deteriorate model accuracy. Apart from these benefits, our pruned models exhibit robustness to adversarial and out-of-distribution samples.

Effective Layer Pruning Through Similarity Metric Perspective

TL;DR

Abstract

Paper Structure (10 sections, 2 equations, 5 figures, 4 tables, 1 algorithm)

This paper contains 10 sections, 2 equations, 5 figures, 4 tables, 1 algorithm.

Introduction
Related Work
Preliminaries and Proposed Method
Experiments
Conclusions
Appendix
Technical Details Involving Layer Pruning
Adversarial Attacks
Results on Shallow Architectures
Results on the Transformer Architecture

Figures (5)

Figure 1: Comparison with state-of-the-art on the popular ResNet56 + CIFAR-10 setting. (Here, for illustration purposes, we abuse notation and bound the ideal point close to two percentage points (pp); however, it may be higher). Overall, our method obtains the best compromises between accuracy and computational reduction (estimated by Floating Point Operations -- FLOPs). Specifically, our method dominates existing layer pruning methods (indicated by symbol *) by a remarkable margin. Compared to state-of-the-art pruning techniques, our method removes more than $75\%$ of FLOPs without hurting accuracy (sometimes improving it). Other methods, however, degrade accuracy when operating at these high FLOP reduction regimes. Since our method is orthogonal to modern structured filter pruning, we can combine them to achieve even higher computation gains (e.g., Ours + $\ell_1$-norm). The behavior shown in this figure is consistent across other benchmarks and architectures.
Figure 2: Relationship between the number of filters removed (x-axis) and latency speed-up (y-axis) for models obtained from filter and layer pruning. Importantly, such a comparison is possible because when pruning removes layers, it eliminates all filters from that layer. Left and right plots stand for ResNet56 and ResNet110, respectively. Overall, layer pruning notably promotes higher speed-up than filter pruning.
Figure 3: Architecture of a residual-like network. Top. The rationale behind this architecture is that the output of a layer takes into account the transformation performed by it ($f$) plus ($\oplus$) the input ($y$) it receives. Due to this essence, when we disable layer $i$ (its transformation -- dashed lines), the output (representation) of layer $i-1$ is propagated to layer $i+1$, which means that the output $y_i$ belongs $y_{i-1}$. For the sake of simplicity, we omit the batch normalization and activation layers, which are also transferred in the process of layer removal. Bottom. Process to eliminate a layer from a technical perspective and, thus, obtain practical speed-up gains. After selecting the victim layer (i.e., Layer $i$), we create a novel architecture without it and, then, transfer the weights (red dashed arrows) of the corresponding survival layers.
Figure 4: Results of pruned models for different adversarial attacks. Green and blue points correspond to an accuracy improvement and degradation, respectively. Dotted lines separate the plots into improvement and degradation groups. Top-Right: Results on out-of-distribution using CIFAR-10.2 Lu:2020. Top-Left: Results on adversarial robustness using CIFAR-C Hendrycks:2019. Bottom-Left: Results on FGSM adversarial attack. Bottom-Right: Results on ImageNet-C using pruned models from ResNet50
Figure 5: Performance of our layer-pruning method on Transformer architecture for human activity recognition based on wearable sensors (tabular data). Each point denotes a pruned model and the black-dashed line indicates the point where the drop in accuracy is zero; thus, points above this line (green) stand for pruned models with an improved accuracy compared to the original, unpruned, model.

Effective Layer Pruning Through Similarity Metric Perspective

TL;DR

Abstract

Effective Layer Pruning Through Similarity Metric Perspective

Authors

TL;DR

Abstract

Table of Contents

Figures (5)