PaPr: Training-Free One-Step Patch Pruning with Lightweight ConvNets for Faster Inference

Tanvir Mahmud; Burhaneddin Yaman; Chun-Hao Liu; Diana Marculescu

PaPr: Training-Free One-Step Patch Pruning with Lightweight ConvNets for Faster Inference

Tanvir Mahmud, Burhaneddin Yaman, Chun-Hao Liu, Diana Marculescu

TL;DR

PaPr is introduced, a method for substantially pruning redundant patches with minimal accuracy loss using lightweight ConvNets across a variety of deep learning architectures, including ViTs, ConvNets, and hybrid transformers, without any re-training.

Abstract

As deep neural networks evolve from convolutional neural networks (ConvNets) to advanced vision transformers (ViTs), there is an increased need to eliminate redundant data for faster processing without compromising accuracy. Previous methods are often architecture-specific or necessitate re-training, restricting their applicability with frequent model updates. To solve this, we first introduce a novel property of lightweight ConvNets: their ability to identify key discriminative patch regions in images, irrespective of model's final accuracy or size. We demonstrate that fully-connected layers are the primary bottleneck for ConvNets performance, and their suppression with simple weight recalibration markedly enhances discriminative patch localization performance. Using this insight, we introduce PaPr, a method for substantially pruning redundant patches with minimal accuracy loss using lightweight ConvNets across a variety of deep learning architectures, including ViTs, ConvNets, and hybrid transformers, without any re-training. Moreover, the simple early-stage one-step patch pruning with PaPr enhances existing patch reduction methods. Through extensive testing on diverse architectures, PaPr achieves significantly higher accuracy over state-of-the-art patch reduction methods with similar FLOP count reduction. More specifically, PaPr reduces about 70% of redundant patches in videos with less than 0.8% drop in accuracy, and up to 3.7x FLOPs reduction, which is a 15% more reduction with 2.5% higher accuracy. Code is released at https://github.com/tanvir-utexas/PaPr.

PaPr: Training-Free One-Step Patch Pruning with Lightweight ConvNets for Faster Inference

TL;DR

Abstract

Paper Structure (30 sections, 4 equations, 20 figures, 9 tables)

This paper contains 30 sections, 4 equations, 20 figures, 9 tables.

Introduction
Related Work
From ConvNets to Vision Transformers
Class Activation Mapping for Explainable Deep Learning
Patch Reduction for Faster Inference
Methodology
Extracting Discriminative Regions with ConvNets
Patch Significance Map
Integrating PSM with Vision Transformers
Integrating PSM with Hierarchical Models
Image Experiments
Experimental Setup
Performance on Various Vision Transformers
Training-free method comparison.
Augreg models:
...and 15 more sections

Figures (20)

Figure 1: (a) Existing patch pruning methods gradually reduce patches over the model. This requires additional training of mask generators in intermediate layers. (b) Proposed PaPr directly prunes redundant patches early in the network by leveraging pretrained lightweight ConvNets and directly speeds-up off-the-shelf models without re-training.
Figure 2: (a) Baseline ConvNet gradually reduces the feature map to produce $\mathcal{F} = \{f_{k}(x, y) \}_{k=1}^K$, followed by global average pooling and fully connected (FC) layers to predict $y_p$. (b) In PaPr, we operate on $\mathcal{F}$ by suppressing the FC layer. Initially, we extract pixel mean over $K$ channels to produce discriminative region proposal $\mathcal{R}$. Later, simple upsampling operation generates the patch significance map (PSM) $\mathcal{P}$ of target dimension. Finally, patch mask $\mathcal{M}$ for top $z\%$ patches is obtained from $\mathcal{P}$.
Figure 3: (a) In vanilla ViT, PaPr operates right after the patch extractor module. Hence, all transformer blocks can operate only with the most discriminative patches. (b) Hierarchical model blocks comprise of window based kernel operator (e.g. Conv$k\times k$/local attention), followed by pixel operator (e.g., linear layer, Conv1x1). Pixel operator consumes more than 60% of total computation. PaPr is used to split the foreground patches to be operated with pixel operator. Background patches are zero-ed out, and finally, re-assembled with foreground output patches.
Figure 4: ImageNet-1k evaluation for varying top-k accuracy targets. The accuracy gain with bigger model largely shrinks, as $k$ increases. This suggests shallower ConvNets have understanding of object locations and visual property, despite their lower top-1 accuracy.
Figure 5: Integrating PaPr with ToMe tome. We use the Augreg pretrained ViT-B-16 architecture as the baseline. We sweep token merging ratio (r) for different pruning ratio (z). Integration of PaPr achieves Pareto-optimal performance, thus, PaPr can enhance existing patch reduction methods.
...and 15 more figures

PaPr: Training-Free One-Step Patch Pruning with Lightweight ConvNets for Faster Inference

TL;DR

Abstract

PaPr: Training-Free One-Step Patch Pruning with Lightweight ConvNets for Faster Inference

Authors

TL;DR

Abstract

Table of Contents

Figures (20)