Table of Contents
Fetching ...

Pruning By Explaining Revisited: Optimizing Attribution Methods to Prune CNNs and Transformers

Sayed Mohammad Vakilzadeh Hatefi, Maximilian Dreyer, Reduan Achtibat, Thomas Wiegand, Wojciech Samek, Sebastian Lapuschkin

TL;DR

This work extends the current state by proposing to explicitly optimize hyperparameters of attribution methods for the task of pruning, and further include transformer-based networks in the analysis, indicating that transformers have a higher degree of over-parameterization compared to convolutional neural networks.

Abstract

To solve ever more complex problems, Deep Neural Networks are scaled to billions of parameters, leading to huge computational costs. An effective approach to reduce computational requirements and increase efficiency is to prune unnecessary components of these often over-parameterized networks. Previous work has shown that attribution methods from the field of eXplainable AI serve as effective means to extract and prune the least relevant network components in a few-shot fashion. We extend the current state by proposing to explicitly optimize hyperparameters of attribution methods for the task of pruning, and further include transformer-based networks in our analysis. Our approach yields higher model compression rates of large transformer- and convolutional architectures (VGG, ResNet, ViT) compared to previous works, while still attaining high performance on ImageNet classification tasks. Here, our experiments indicate that transformers have a higher degree of over-parameterization compared to convolutional neural networks. Code is available at https://github.com/erfanhatefi/Pruning-by-eXplaining-in-PyTorch.

Pruning By Explaining Revisited: Optimizing Attribution Methods to Prune CNNs and Transformers

TL;DR

This work extends the current state by proposing to explicitly optimize hyperparameters of attribution methods for the task of pruning, and further include transformer-based networks in the analysis, indicating that transformers have a higher degree of over-parameterization compared to convolutional neural networks.

Abstract

To solve ever more complex problems, Deep Neural Networks are scaled to billions of parameters, leading to huge computational costs. An effective approach to reduce computational requirements and increase efficiency is to prune unnecessary components of these often over-parameterized networks. Previous work has shown that attribution methods from the field of eXplainable AI serve as effective means to extract and prune the least relevant network components in a few-shot fashion. We extend the current state by proposing to explicitly optimize hyperparameters of attribution methods for the task of pruning, and further include transformer-based networks in our analysis. Our approach yields higher model compression rates of large transformer- and convolutional architectures (VGG, ResNet, ViT) compared to previous works, while still attaining high performance on ImageNet classification tasks. Here, our experiments indicate that transformers have a higher degree of over-parameterization compared to convolutional neural networks. Code is available at https://github.com/erfanhatefi/Pruning-by-eXplaining-in-PyTorch.
Paper Structure (40 sections, 21 equations, 19 figures, 6 tables)

This paper contains 40 sections, 21 equations, 19 figures, 6 tables.

Figures (19)

  • Figure 1: We propose a pruning framework based on optimizing attribution methods from the field of . Compared to random pruning, pruning the least relevant structures first ("relevant" according to an attribution method of choice, and indicated by red color), has been shown to result in an improved performance-sparsity tradeoff (simplified illustration depicted). By optimizing attribution methods specifically for pruning, we can reduce the tradeoff even further.
  • Figure 2: Attribution-based pruning workflow: Firstly, the relevant model structures are identified by explaining a set of reference samples. The attribution method of choice (here ) highlights the components and paths in the network which positively and negatively contribute to the decision-making. Positive and negative relevances are indicated by red and blue color respectively, and components with white color indicate zero or low relevance. Removing structures that receive the least relevance results in a sparser subnetwork, which performs significantly better than after random pruning. Notably, relevances can be computed w.r.t. a subset of output classes (e.g., "corgi" only), resulting in a subnetwork specifically designed to perform the restricted task. Credit: Nataba/iStock.
  • Figure 3: Investigating over-parameterization in through attribution-based pruning (-opt in blue, Integrated Gradient in orange color) and random pruning (green color). We compare pruning of all models w.r.t. different task difficulties, i.e., to differentiate between 1000 (dotted line), 100 (dashed line) or three ImageNet classes (solid line). High performance for high sparsification rates indicates over-parameterization, i.e., many network components are not important for the task. Compared to the ResNet-18 and VGG-16-BN (top), the ViT-B-16 transformer shows a higher degree of over-parameterization (bottom). is illustrated (shaded area) in the current and all other figures.
  • Figure 4: Pruning models pre-trained on ImageNet (simplified task to detecting three classes), using ten reference samples per class. Results show a better sparsification-performance tradeoff for our optimized composite compared to a heuristic (faithful) composite, Yeom et al.yeom2021pruning (details for each in \ref{['app:tab:cnn_composites']}, \ref{['app:sec:faithful_lrp']}, and \ref{['eq:lrp-zplus']}) and random pruning.
  • Figure 5: Attribution-based pruning using a different number of reference samples (per class) to estimate the importance of attention heads (left) or neurons in linear layers (right) of the ViT-B-16. This experiment has been conducted for 20 different random seeds. For the propagation of , LRP-$\epsilon$ has been set as our parameter for all layers (\ref{['sec:methods:lrp_hyperparameters']}), and w.r.t. the attribution of softmax operations (\ref{['app:sec:lrp_transformers']}).
  • ...and 14 more figures