Dense Vision Transformer Compression with Few Samples

Hanxiao Zhang; Yifan Zhou; Guo-Hua Wang; Jianxin Wu

Dense Vision Transformer Compression with Few Samples

Hanxiao Zhang, Yifan Zhou, Guo-Hua Wang, Jianxin Wu

TL;DR

DC-ViT introduces a dense few-shot compression framework for Vision Transformers by selectively removing the attention component in each block while reusing and resizing the MLP, enabling a dense range of MACs reductions with minimal data. It uses a three-stage process: determine the compressed structure with a calculable block count, select blocks via a synthetic metric set generated from a pre-trained model, and progressively prune with finetuning on unlabeled data guided by feature mimicking. The approach outperforms state-of-the-art few-shot baselines (notably PRACTISE) in accuracy at comparable MAC reductions across ViT variants and even extends to CNNs, all with lower latency. The use of synthetic metric data to predict recoverability, coupled with progressive, partial finetuning and MLP weight reuse, yields robust performance and good transferability to downstream tasks, suggesting practical impact for deploying large ViTs on resource-constrained devices.

Abstract

Few-shot model compression aims to compress a large model into a more compact one with only a tiny training set (even without labels). Block-level pruning has recently emerged as a leading technique in achieving high accuracy and low latency in few-shot CNN compression. But, few-shot compression for Vision Transformers (ViT) remains largely unexplored, which presents a new challenge. In particular, the issue of sparse compression exists in traditional CNN few-shot methods, which can only produce very few compressed models of different model sizes. This paper proposes a novel framework for few-shot ViT compression named DC-ViT. Instead of dropping the entire block, DC-ViT selectively eliminates the attention module while retaining and reusing portions of the MLP module. DC-ViT enables dense compression, which outputs numerous compressed models that densely populate the range of model complexity. DC-ViT outperforms state-of-the-art few-shot compression methods by a significant margin of 10 percentage points, along with lower latency in the compression of ViT and its variants.

Dense Vision Transformer Compression with Few Samples

TL;DR

Abstract

Paper Structure (21 sections, 6 equations, 8 figures, 9 tables, 1 algorithm)

This paper contains 21 sections, 6 equations, 8 figures, 9 tables, 1 algorithm.

Introduction
Related Works
Method
Determining the Network Structure
Block Selection
Block-wise Trial
Generate a Synthetic Metric Set
Choose the Blocks to Compress
Progressive Pruning and Finetuning
Experiments
Experimental Settings
Baseline Pruning Methods
DC-ViT Performance
Results on ViT-Base
Results on CNN Architectures
...and 6 more sections

Figures (8)

Figure 1: Top-1 accuracy vs. MACs (G) on ImageNet. Our DC-ViT was compressed with few training images (50, 100, 500, or 1000), and PRACTISE was compressed with 500 images. However, other methods were not few-shot, using 1.28 million labeled training images during compression. It is quite a success that our DC-VIT uses only less than 0.1% of their training images without labels, but achieves slightly lower or even higher accuracy.
Figure 2: The range of compression options attainable by our DC-ViT and PRACTISE wang2023practical by varying the number of compressed blocks. The black vertical lines represent the specific compression rates achieved by wang2023practical, highlighted by the red arrows. DC-ViT can output models densely in a wide range of compression rates for different numbers of blocks. The gradient color bar indicates the accuracy of our method at different compression rates.
Figure 3: The DC-ViT framework. (a) Determining the network structure to achieve the target MACs reduction. (b) Generating a synthetic metric set from Gaussian noise. (c) Using the synthetic metric set to select the blocks with highest recoverability. (d) Progressive pruning and finetuning. The original model is at the top, and the pruned model is at the bottom. We drop the whole attention module but reuse part of the MLP, and use the MSE loss for feature mimicking, but only update the front part of blocks till the next block of the compressed one.
Figure 4: The top-1 error of 12 candidate models (by compressing 12 ViT blocks one at a time) on the test dataset and different criteria for block selection. The top-1 error is calculated using the original test set. The metric loss is calculated using the synthetic metric set $\mathcal{S}$. The training loss is calculated using the tiny training set $\mathcal{D}_{\mathcal{T}}$. We scaled them to the same range.
Figure 5: The synthetic metric set generated by ViT-Tiny.
...and 3 more figures

Dense Vision Transformer Compression with Few Samples

TL;DR

Abstract

Dense Vision Transformer Compression with Few Samples

Authors

TL;DR

Abstract

Table of Contents

Figures (8)