Table of Contents
Fetching ...

Data-independent Module-aware Pruning for Hierarchical Vision Transformers

Yang He, Joey Tianyi Zhou

TL;DR

This work addresses the challenge of pruning hierarchical Vision Transformers, whose local window attention and patch-merge operations complicate traditional magnitude-based pruning. It introduces Data-independent Module-Aware Pruning (DIMAP), which defines modules within Swin Transformers and evaluates weight importance through information distortion, yielding a data-free metric: $\operatorname{Imp}(W_{p_j}) = \frac{(W_{p_j})^2}{\sum_{i \leq p_j} (W_i)^2}$. DIMAP enables one-shot, module-level pruning that respects cross-module contributions and avoids reliance on input data or patch merging dynamics. Experiments on ImageNet-1K with Swin-T/S/B demonstrate that DIMAP achieves substantial reductions in FLOPs and parameters while maintaining or even improving accuracy relative to baselines and state-of-the-art methods. The results suggest that module-aware, distortion-based pruning is a practical and effective strategy for compressing hierarchical ViTs, with potential for downstream tasks and broader ViT variants.

Abstract

Hierarchical vision transformers (ViTs) have two advantages over conventional ViTs. First, hierarchical ViTs achieve linear computational complexity with respect to image size by local self-attention. Second, hierarchical ViTs create hierarchical feature maps by merging image patches in deeper layers for dense prediction. However, existing pruning methods ignore the unique properties of hierarchical ViTs and use the magnitude value as the weight importance. This approach leads to two main drawbacks. First, the "local" attention weights are compared at a "global" level, which may cause some "locally" important weights to be pruned due to their relatively small magnitude "globally". The second issue with magnitude pruning is that it fails to consider the distinct weight distributions of the network, which are essential for extracting coarse to fine-grained features at various hierarchical levels. To solve the aforementioned issues, we have developed a Data-independent Module-Aware Pruning method (DIMAP) to compress hierarchical ViTs. To ensure that "local" attention weights at different hierarchical levels are compared fairly in terms of their contribution, we treat them as a module and examine their contribution by analyzing their information distortion. Furthermore, we introduce a novel weight metric that is solely based on weights and does not require input images, thereby eliminating the dependence on the patch merging process. Our method validates its usefulness and strengths on Swin Transformers of different sizes on ImageNet-1k classification. Notably, the top-5 accuracy drop is only 0.07% when we remove 52.5% FLOPs and 52.7% parameters of Swin-B. When we reduce 33.2% FLOPs and 33.2% parameters of Swin-S, we can even achieve a 0.8% higher relative top-5 accuracy than the original model. Code is available at: https://github.com/he-y/Data-independent-Module-Aware-Pruning

Data-independent Module-aware Pruning for Hierarchical Vision Transformers

TL;DR

This work addresses the challenge of pruning hierarchical Vision Transformers, whose local window attention and patch-merge operations complicate traditional magnitude-based pruning. It introduces Data-independent Module-Aware Pruning (DIMAP), which defines modules within Swin Transformers and evaluates weight importance through information distortion, yielding a data-free metric: . DIMAP enables one-shot, module-level pruning that respects cross-module contributions and avoids reliance on input data or patch merging dynamics. Experiments on ImageNet-1K with Swin-T/S/B demonstrate that DIMAP achieves substantial reductions in FLOPs and parameters while maintaining or even improving accuracy relative to baselines and state-of-the-art methods. The results suggest that module-aware, distortion-based pruning is a practical and effective strategy for compressing hierarchical ViTs, with potential for downstream tasks and broader ViT variants.

Abstract

Hierarchical vision transformers (ViTs) have two advantages over conventional ViTs. First, hierarchical ViTs achieve linear computational complexity with respect to image size by local self-attention. Second, hierarchical ViTs create hierarchical feature maps by merging image patches in deeper layers for dense prediction. However, existing pruning methods ignore the unique properties of hierarchical ViTs and use the magnitude value as the weight importance. This approach leads to two main drawbacks. First, the "local" attention weights are compared at a "global" level, which may cause some "locally" important weights to be pruned due to their relatively small magnitude "globally". The second issue with magnitude pruning is that it fails to consider the distinct weight distributions of the network, which are essential for extracting coarse to fine-grained features at various hierarchical levels. To solve the aforementioned issues, we have developed a Data-independent Module-Aware Pruning method (DIMAP) to compress hierarchical ViTs. To ensure that "local" attention weights at different hierarchical levels are compared fairly in terms of their contribution, we treat them as a module and examine their contribution by analyzing their information distortion. Furthermore, we introduce a novel weight metric that is solely based on weights and does not require input images, thereby eliminating the dependence on the patch merging process. Our method validates its usefulness and strengths on Swin Transformers of different sizes on ImageNet-1k classification. Notably, the top-5 accuracy drop is only 0.07% when we remove 52.5% FLOPs and 52.7% parameters of Swin-B. When we reduce 33.2% FLOPs and 33.2% parameters of Swin-S, we can even achieve a 0.8% higher relative top-5 accuracy than the original model. Code is available at: https://github.com/he-y/Data-independent-Module-Aware-Pruning
Paper Structure (14 sections, 24 equations, 6 figures, 3 tables)

This paper contains 14 sections, 24 equations, 6 figures, 3 tables.

Figures (6)

  • Figure 1: Parameters and FLOPs for different components of CNNs (VGG-16, Inceptionv3, ResNet-50) and Swin Transformer. "Conv" = Convolutional layers, "FC" = Fully-connected layers, "ATT" = Attention layers. The numbers on top of the histograms are the total parameters and FLOPs of the network. The percentage numbers listed inside histograms are the contribution from the corresponding layers. Note that norm layers and down-sampling layers are not included for better visualization.
  • Figure 2: Weight distributions for different functional layers for the first and second Swin Transformer block. The meaning of QKC, PRJ, FC1 and FC2 can be found in Figure \ref{['fig:Category']}. The x-axis indicates the values of the weights, and the y-axis denotes the ratios of weights.
  • Figure 3: Categorize network layers to different modules regarding (a) Swin Transformer block; (b) Swin Transformer network. The left figure contains detailed layers of a Swin Transformer block. There are four stages in Swin Transformer. Different colors represent different modules, including the attention-related module (QKV-M and PRJ-M), the multilayer perceptron-related module (MLP-M), and the auxiliary module (AUX-M).
  • Figure 4: Comparison of the theoretical and realistic acceleration. Only the time consumption of the forward procedure is considered.
  • Figure 5: Results of pruning 45% of the parameters from the Swin-S with our module-level weight importance. The x-axis denotes the layer index, and the y-axis indicates the ratio of the remaining parameters. Different colors represent different modules.
  • ...and 1 more figures