Table of Contents
Fetching ...

PaPr: Training-Free One-Step Patch Pruning with Lightweight ConvNets for Faster Inference

Tanvir Mahmud, Burhaneddin Yaman, Chun-Hao Liu, Diana Marculescu

TL;DR

PaPr is introduced, a method for substantially pruning redundant patches with minimal accuracy loss using lightweight ConvNets across a variety of deep learning architectures, including ViTs, ConvNets, and hybrid transformers, without any re-training.

Abstract

As deep neural networks evolve from convolutional neural networks (ConvNets) to advanced vision transformers (ViTs), there is an increased need to eliminate redundant data for faster processing without compromising accuracy. Previous methods are often architecture-specific or necessitate re-training, restricting their applicability with frequent model updates. To solve this, we first introduce a novel property of lightweight ConvNets: their ability to identify key discriminative patch regions in images, irrespective of model's final accuracy or size. We demonstrate that fully-connected layers are the primary bottleneck for ConvNets performance, and their suppression with simple weight recalibration markedly enhances discriminative patch localization performance. Using this insight, we introduce PaPr, a method for substantially pruning redundant patches with minimal accuracy loss using lightweight ConvNets across a variety of deep learning architectures, including ViTs, ConvNets, and hybrid transformers, without any re-training. Moreover, the simple early-stage one-step patch pruning with PaPr enhances existing patch reduction methods. Through extensive testing on diverse architectures, PaPr achieves significantly higher accuracy over state-of-the-art patch reduction methods with similar FLOP count reduction. More specifically, PaPr reduces about 70% of redundant patches in videos with less than 0.8% drop in accuracy, and up to 3.7x FLOPs reduction, which is a 15% more reduction with 2.5% higher accuracy. Code is released at https://github.com/tanvir-utexas/PaPr.

PaPr: Training-Free One-Step Patch Pruning with Lightweight ConvNets for Faster Inference

TL;DR

PaPr is introduced, a method for substantially pruning redundant patches with minimal accuracy loss using lightweight ConvNets across a variety of deep learning architectures, including ViTs, ConvNets, and hybrid transformers, without any re-training.

Abstract

As deep neural networks evolve from convolutional neural networks (ConvNets) to advanced vision transformers (ViTs), there is an increased need to eliminate redundant data for faster processing without compromising accuracy. Previous methods are often architecture-specific or necessitate re-training, restricting their applicability with frequent model updates. To solve this, we first introduce a novel property of lightweight ConvNets: their ability to identify key discriminative patch regions in images, irrespective of model's final accuracy or size. We demonstrate that fully-connected layers are the primary bottleneck for ConvNets performance, and their suppression with simple weight recalibration markedly enhances discriminative patch localization performance. Using this insight, we introduce PaPr, a method for substantially pruning redundant patches with minimal accuracy loss using lightweight ConvNets across a variety of deep learning architectures, including ViTs, ConvNets, and hybrid transformers, without any re-training. Moreover, the simple early-stage one-step patch pruning with PaPr enhances existing patch reduction methods. Through extensive testing on diverse architectures, PaPr achieves significantly higher accuracy over state-of-the-art patch reduction methods with similar FLOP count reduction. More specifically, PaPr reduces about 70% of redundant patches in videos with less than 0.8% drop in accuracy, and up to 3.7x FLOPs reduction, which is a 15% more reduction with 2.5% higher accuracy. Code is released at https://github.com/tanvir-utexas/PaPr.
Paper Structure (30 sections, 4 equations, 20 figures, 9 tables)

This paper contains 30 sections, 4 equations, 20 figures, 9 tables.

Figures (20)

  • Figure 1: (a) Existing patch pruning methods gradually reduce patches over the model. This requires additional training of mask generators in intermediate layers. (b) Proposed PaPr directly prunes redundant patches early in the network by leveraging pretrained lightweight ConvNets and directly speeds-up off-the-shelf models without re-training.
  • Figure 2: (a) Baseline ConvNet gradually reduces the feature map to produce $\mathcal{F} = \{f_{k}(x, y) \}_{k=1}^K$, followed by global average pooling and fully connected (FC) layers to predict $y_p$. (b) In PaPr, we operate on $\mathcal{F}$ by suppressing the FC layer. Initially, we extract pixel mean over $K$ channels to produce discriminative region proposal $\mathcal{R}$. Later, simple upsampling operation generates the patch significance map (PSM) $\mathcal{P}$ of target dimension. Finally, patch mask $\mathcal{M}$ for top $z\%$ patches is obtained from $\mathcal{P}$.
  • Figure 3: (a) In vanilla ViT, PaPr operates right after the patch extractor module. Hence, all transformer blocks can operate only with the most discriminative patches. (b) Hierarchical model blocks comprise of window based kernel operator (e.g. Conv$k\times k$/local attention), followed by pixel operator (e.g., linear layer, Conv1x1). Pixel operator consumes more than 60% of total computation. PaPr is used to split the foreground patches to be operated with pixel operator. Background patches are zero-ed out, and finally, re-assembled with foreground output patches.
  • Figure 4: ImageNet-1k evaluation for varying top-k accuracy targets. The accuracy gain with bigger model largely shrinks, as $k$ increases. This suggests shallower ConvNets have understanding of object locations and visual property, despite their lower top-1 accuracy.
  • Figure 5: Integrating PaPr with ToMe tome. We use the Augreg pretrained ViT-B-16 architecture as the baseline. We sweep token merging ratio (r) for different pruning ratio (z). Integration of PaPr achieves Pareto-optimal performance, thus, PaPr can enhance existing patch reduction methods.
  • ...and 15 more figures