Table of Contents
Fetching ...

The Need for Speed: Pruning Transformers with One Recipe

Samir Khaki, Konstantinos N. Plataniotis

TL;DR

OPTIN presents one of the first one-shot efficient frameworks for compressing transformer architectures that generalizes well across different class domains, in particular: natural language and image-related tasks, without re-training.

Abstract

We introduce the $\textbf{O}$ne-shot $\textbf{P}$runing $\textbf{T}$echnique for $\textbf{I}$nterchangeable $\textbf{N}$etworks ($\textbf{OPTIN}$) framework as a tool to increase the efficiency of pre-trained transformer architectures $\textit{without requiring re-training}$. Recent works have explored improving transformer efficiency, however often incur computationally expensive re-training procedures or depend on architecture-specific characteristics, thus impeding practical wide-scale adoption. To address these shortcomings, the OPTIN framework leverages intermediate feature distillation, capturing the long-range dependencies of model parameters (coined $\textit{trajectory}$), to produce state-of-the-art results on natural language, image classification, transfer learning, and semantic segmentation tasks $\textit{without re-training}$. Given a FLOP constraint, the OPTIN framework will compress the network while maintaining competitive accuracy performance and improved throughput. Particularly, we show a $\leq 2$% accuracy degradation from NLP baselines and a $0.5$% improvement from state-of-the-art methods on image classification at competitive FLOPs reductions. We further demonstrate the generalization of tasks and architecture with comparative performance using Mask2Former for semantic segmentation and cnn-style networks. OPTIN presents one of the first one-shot efficient frameworks for compressing transformer architectures that generalizes well across different class domains, in particular: natural language and image-related tasks, without $\textit{re-training}$.

The Need for Speed: Pruning Transformers with One Recipe

TL;DR

OPTIN presents one of the first one-shot efficient frameworks for compressing transformer architectures that generalizes well across different class domains, in particular: natural language and image-related tasks, without re-training.

Abstract

We introduce the ne-shot runing echnique for nterchangeable etworks () framework as a tool to increase the efficiency of pre-trained transformer architectures . Recent works have explored improving transformer efficiency, however often incur computationally expensive re-training procedures or depend on architecture-specific characteristics, thus impeding practical wide-scale adoption. To address these shortcomings, the OPTIN framework leverages intermediate feature distillation, capturing the long-range dependencies of model parameters (coined ), to produce state-of-the-art results on natural language, image classification, transfer learning, and semantic segmentation tasks . Given a FLOP constraint, the OPTIN framework will compress the network while maintaining competitive accuracy performance and improved throughput. Particularly, we show a % accuracy degradation from NLP baselines and a % improvement from state-of-the-art methods on image classification at competitive FLOPs reductions. We further demonstrate the generalization of tasks and architecture with comparative performance using Mask2Former for semantic segmentation and cnn-style networks. OPTIN presents one of the first one-shot efficient frameworks for compressing transformer architectures that generalizes well across different class domains, in particular: natural language and image-related tasks, without .
Paper Structure (20 sections, 6 equations, 7 figures, 8 tables, 1 algorithm)

This paper contains 20 sections, 6 equations, 7 figures, 8 tables, 1 algorithm.

Figures (7)

  • Figure 1: Illustrates the computation of the OPTIN Frameworks trajectory metric on weight $\theta_{i,j}$. By applying a mask to weight $\theta_{i,j}$ in $\texttt{Layer}_i$ and executing a forward pass, the OPTIN framework can measure the effect on future layer embeddings (trajectory), as an indicator of weight importance. $\mathcal{L}_{MD}$ is the manifold distillation loss computed between layer embeddings at each transformer block, while $\mathcal{L}_{KD}$ is the KL-Divergence computed between the original logits and those due to the masked weight. The combination losses are further detailed in the Weight Importance heading under Sec. \ref{['sec:sec_method_traj']}
  • Figure 2: Natural Language Benchmarks. Comparing OPTIN performance on the GLUE wang2018glue benchmark (refer to \ref{['appendix:addLangExp']} for additional results). The relative FLOP constraint is set to 60% for a fair comparison.
  • Figure 2: Natural Language FLOPs vs Accuracy. We directly benchmark the OPTIN Framework against leading state-of-the-art methods in natural language model compression. Due to the numerous different baselines reported in each work, we plot the relative performance drops for each method. On the right, we compare the performance gap with latency showing that with an average drop of $\leq1.75\%$ we can achieve a $1.25\times$ speedup in throughput purely from static model size reduction.
  • Figure 3: Pruning ImageNet-1K. Benchmarking the performance of OPTIN using DeiT-Tiny/Small. $^{\dagger}$ methods are reproduced in song2022cpvit without re-training. $^{\dagger \dagger}$ DeiT-S result from chuanyang2022savit is excluded as it performs superior to the available baseline. OPTIN framework runs without re-training producing both the $\beta$ and $\tau$ configurations.
  • Figure 4: Transfer Learning on CIFAR Dataset. Benchmarking the performance of OPTIN on the CIFAR-10 Datasets. Models were pre-trained on ImageNet-1K, pruned through the OPTIN Framework $\tau$ configuration, and transferred learned at a more aggressive pruning rate onto the CIFAR-10 (C-10) Dataset.
  • ...and 2 more figures