Once for Both: Single Stage of Importance and Sparsity Search for Vision Transformer Compression

Hancheng Ye; Chong Yu; Peng Ye; Renqiu Xia; Yansong Tang; Jiwen Lu; Tao Chen; Bo Zhang

Once for Both: Single Stage of Importance and Sparsity Search for Vision Transformer Compression

Hancheng Ye, Chong Yu, Peng Ye, Renqiu Xia, Yansong Tang, Jiwen Lu, Tao Chen, Bo Zhang

TL;DR

Vision Transformer compression commonly relies on a two-stage process that separately estimates importance and sparsity, causing distribution gaps and costly search. OFB introduces a one-stage approach that jointly learns a bi-mask score $m_{ij}(t)=\lambda(t) \mathcal{S}_{ij} + (1-\lambda(t)) \mathcal{V}_{ij}(\alpha)$ to Entangle importance $\mathcal{S}$ and differentiable sparsity $\mathcal{V}$, guided by an Adaptive One-hot Loss and reinforced by Progressive Masked Image Modeling ($L_{rec}$ with a gradually increasing mask ratio $\gamma$). The method delivers superior compression performance on DeiT and Swin models with markedly reduced search time (about one GPU-day) and strong transferability to downstream tasks, validating its practical impact for deploying efficient ViTs. Overall, OFB provides a scalable, end-to-end framework that achieves high sparsity while preserving accuracy, enabling feasible deployment of resource-constrained vision applications.

Abstract

Recent Vision Transformer Compression (VTC) works mainly follow a two-stage scheme, where the importance score of each model unit is first evaluated or preset in each submodule, followed by the sparsity score evaluation according to the target sparsity constraint. Such a separate evaluation process induces the gap between importance and sparsity score distributions, thus causing high search costs for VTC. In this work, for the first time, we investigate how to integrate the evaluations of importance and sparsity scores into a single stage, searching the optimal subnets in an efficient manner. Specifically, we present OFB, a cost-efficient approach that simultaneously evaluates both importance and sparsity scores, termed Once for Both (OFB), for VTC. First, a bi-mask scheme is developed by entangling the importance score and the differentiable sparsity score to jointly determine the pruning potential (prunability) of each unit. Such a bi-mask search strategy is further used together with a proposed adaptive one-hot loss to realize the progressive-and-efficient search for the most important subnet. Finally, Progressive Masked Image Modeling (PMIM) is proposed to regularize the feature space to be more representative during the search process, which may be degraded by the dimension reduction. Extensive experiments demonstrate that OFB can achieve superior compression performance over state-of-the-art searching-based and pruning-based methods under various Vision Transformer architectures, meanwhile promoting search efficiency significantly, e.g., costing one GPU search day for the compression of DeiT-S on ImageNet-1K.

Once for Both: Single Stage of Importance and Sparsity Search for Vision Transformer Compression

TL;DR

to Entangle importance

and differentiable sparsity

, guided by an Adaptive One-hot Loss and reinforced by Progressive Masked Image Modeling (

with a gradually increasing mask ratio

). The method delivers superior compression performance on DeiT and Swin models with markedly reduced search time (about one GPU-day) and strong transferability to downstream tasks, validating its practical impact for deploying efficient ViTs. Overall, OFB provides a scalable, end-to-end framework that achieves high sparsity while preserving accuracy, enabling feasible deployment of resource-constrained vision applications.

Abstract

Paper Structure (29 sections, 14 equations, 11 figures, 9 tables, 1 algorithm)

This paper contains 29 sections, 14 equations, 11 figures, 9 tables, 1 algorithm.

Introduction
Related Works
Transformer Architecture Search.
Vision Transformer Pruning.
Masked Image Modeling.
The Proposed Approach
Problem Formulation
Search Space.
Bi-mask Weight-sharing Strategy
Adaptive One-hot Loss
Progressive MIM
Experiments
Results on ImageNet
Transfer Learning Results
Ablation Study
...and 14 more sections

Figures (11)

Figure 1: The relationship between importance and sparsity score distributions in different search paradigms. (a) Importance scores are fixed during sparsity search, and sparsity scores are related to importance scores. (b) Importance scores of one submodule are also related to the sparsity of other to-prune submodules. (c) Importance and sparsity scores are entangled and simultaneously optimized, thus correlated at forward and backward phases of searching.
Figure 2: Different paradigms for VTC. (a): SPOS-based TAS implicitly encodes the piecewise-decreasing importance scores for units due to the uniform sampling in pre-training; (b): The threshold-based TP explicitly evaluates the importance scores for units and sets a global threshold to perform pruning; (c): DARTS learns the importance distribution in a differentiable manner and selects the subnet of the highest architecture score; (d): OFB proposes the bi-mask score that entangles importance and sparsity scores together, to perform the search process in a single stage.
Figure 3: The overview of OFB search framework, including the design of search space, search scheme, and regularization scheme. (a) For the search space, we consider four types of submodules. (b) For the search scheme, we simultaneously learn the importance score $\mathcal{S}$ and the sparsity score $\mathcal{V}$ based on the bi-mask weight-sharing strategy, under the guidance of an adaptive one-hot loss. (c) The PMIM technique is developed to augment the pruned feature space, which introduces a progressive masking strategy to MIM for better regularization.
Figure 4: Performance of the searched DeiT models with/without retraining by employing/not employing PMIM during searching.
Figure 5: Visualization of bi-mask search process. Each line/surface is a descendingly-ordered distribution learned after one-epoch search, with the lighter color denoting a later learned distribution.
...and 6 more figures

Theorems & Definitions (2)

proof
proof

Once for Both: Single Stage of Importance and Sparsity Search for Vision Transformer Compression

TL;DR

Abstract

Once for Both: Single Stage of Importance and Sparsity Search for Vision Transformer Compression

Authors

TL;DR

Abstract

Table of Contents

Figures (11)

Theorems & Definitions (2)