Table of Contents
Fetching ...

SNP: Structured Neuron-level Pruning to Preserve Attention Scores

Kyunghwan Shim, Jaewoong Yun, Shinkook Choi

TL;DR

Structured Neuron-level Pruning (SNP) prunes graphically connected query and key layers having the least informative attention scores while preserving the overall attention scores and effectively compresses and accelerates Transformer-based models for both edge devices and server processors.

Abstract

Multi-head self-attention (MSA) is a key component of Vision Transformers (ViTs), which have achieved great success in various vision tasks. However, their high computational cost and memory footprint hinder their deployment on resource-constrained devices. Conventional pruning approaches can only compress and accelerate the MSA module using head pruning, although the head is not an atomic unit. To address this issue, we propose a novel graph-aware neuron-level pruning method, Structured Neuron-level Pruning (SNP). SNP prunes neurons with less informative attention scores and eliminates redundancy among heads. Specifically, it prunes graphically connected query and key layers having the least informative attention scores while preserving the overall attention scores. Value layers, which can be pruned independently, are pruned to eliminate inter-head redundancy. Our proposed method effectively compresses and accelerates Transformer-based models for both edge devices and server processors. For instance, the DeiT-Small with SNP runs 3.1$\times$ faster than the original model and achieves performance that is 21.94\% faster and 1.12\% higher than the DeiT-Tiny. Additionally, SNP combine successfully with conventional head or block pruning approaches. SNP with head pruning could compress the DeiT-Base by 80\% of the parameters and computational costs and achieve 3.85$\times$ faster inference speed on RTX3090 and 4.93$\times$ on Jetson Nano.

SNP: Structured Neuron-level Pruning to Preserve Attention Scores

TL;DR

Structured Neuron-level Pruning (SNP) prunes graphically connected query and key layers having the least informative attention scores while preserving the overall attention scores and effectively compresses and accelerates Transformer-based models for both edge devices and server processors.

Abstract

Multi-head self-attention (MSA) is a key component of Vision Transformers (ViTs), which have achieved great success in various vision tasks. However, their high computational cost and memory footprint hinder their deployment on resource-constrained devices. Conventional pruning approaches can only compress and accelerate the MSA module using head pruning, although the head is not an atomic unit. To address this issue, we propose a novel graph-aware neuron-level pruning method, Structured Neuron-level Pruning (SNP). SNP prunes neurons with less informative attention scores and eliminates redundancy among heads. Specifically, it prunes graphically connected query and key layers having the least informative attention scores while preserving the overall attention scores. Value layers, which can be pruned independently, are pruned to eliminate inter-head redundancy. Our proposed method effectively compresses and accelerates Transformer-based models for both edge devices and server processors. For instance, the DeiT-Small with SNP runs 3.1 faster than the original model and achieves performance that is 21.94\% faster and 1.12\% higher than the DeiT-Tiny. Additionally, SNP combine successfully with conventional head or block pruning approaches. SNP with head pruning could compress the DeiT-Base by 80\% of the parameters and computational costs and achieve 3.85 faster inference speed on RTX3090 and 4.93 on Jetson Nano.
Paper Structure (32 sections, 7 equations, 8 figures, 6 tables)

This paper contains 32 sections, 7 equations, 8 figures, 6 tables.

Figures (8)

  • Figure 1: Comparison of model size, speed, and performance. ImageNet-$1\mathrm{K}$ classification results. Latency is profiled by Rasbperry Pi 4B. The connected lines represent the compressed models paired with the original model. The size of each circle indicates the number of parameters in respective model. The number adjacent to each compressed model indicates its compressed GFLOPs.
  • Figure 2: Proposed SNP methods, on each prunable component of the Transformer block.(a) SNP pruning criteria of query and key layers to preserve attention scores. (b) prunable components of Transformer block. (c) SNP pruning criteria of value and other layers, including FFN and patch embedding. (d) conventional zeroing out in the matrix multiplication operator. (e) Conventional zeroing out and graph-aware pruning in the residual connection.
  • Figure 3: Attention maps with varying pruning criteria and compression ratios. All query and key layers are locally pruned based on the specified pruning ratio without fine-tuning. The importance scores of $l2$-norm and GM on query and key layers are combined by filter index and pruned simultaneously. “Reverse" represents the reverse order of SNP.
  • Figure 4: Attention maps from the original, compressed, and fine-tuned DeiT-Tiny with SNP. The attention maps in the first row are visualized using the attention rollout attention_rollout. Each red box contains three attention maps from each head of the MSA module, ordered accordingly.
  • Figure 5: Top-1 accuracy of compressed DeiT-Tiny on ImageNet using several pruning criteria without fine-tuning. Query and key layers are locally pruned using various pruning criteria : SNP, GM, $l2$-norm, reverse order of SNP (“Reverse"), and original DeiT-Tiny (“Baseline").
  • ...and 3 more figures