Table of Contents
Fetching ...

(PASS) Visual Prompt Locates Good Structure Sparsity through a Recurrent HyperNetwork

Tianjin Huang, Fang Meng, Li Shen, Fan Liu, Yulong Pei, Mykola Pechenizkiy, Shiwei Liu, Tianlong Chen

TL;DR

This work tackles the efficiency challenge of large vision models by proposing PASS, a data-centric channel pruning framework that uses visual prompts to guide the estimation of layer-wise channel importance. A recurrent hypernetwork, implemented with an LSTM, generates masks $M^{(i)}$ conditioned on weight statistics $W^{(i)}$ and visual prompts $V$, applying masks to produce sparse weights via $\widehat{W}^{(i)} = M^{(i-1)} \otimes W^{(i)} \otimes M^{(i)}$. The visual-prompt encoder $g_{\\omega}$ initializes the LSTM, channel scores are obtained through a linear mapper with straight-through discretization, and global pruning determines layer-wide sparsity, all optimized jointly over $\\theta$, $\\omega$, and $V$ before fine-tuning the sparse subnetworks. Empirical results across six datasets and four architectures show that PASS yields higher accuracy at the same FLOPs and provides substantial transferability of both the sparse structure and the hypernetwork, with ablations confirming the necessity of visual prompts, recurrence, and the chosen pruning strategy. Overall, PASS demonstrates that data-centric prompts can effectively guide structural sparsification, enabling efficient deployment of large CNNs and suggesting a general pathway for prompt-informed model compression.

Abstract

Large-scale neural networks have demonstrated remarkable performance in different domains like vision and language processing, although at the cost of massive computation resources. As illustrated by compression literature, structural model pruning is a prominent algorithm to encourage model efficiency, thanks to its acceleration-friendly sparsity patterns. One of the key questions of structural pruning is how to estimate the channel significance. In parallel, work on data-centric AI has shown that prompting-based techniques enable impressive generalization of large language models across diverse downstream tasks. In this paper, we investigate a charming possibility - \textit{leveraging visual prompts to capture the channel importance and derive high-quality structural sparsity}. To this end, we propose a novel algorithmic framework, namely \texttt{PASS}. It is a tailored hyper-network to take both visual prompts and network weight statistics as input, and output layer-wise channel sparsity in a recurrent manner. Such designs consider the intrinsic channel dependency between layers. Comprehensive experiments across multiple network architectures and six datasets demonstrate the superiority of \texttt{PASS} in locating good structural sparsity. For example, at the same FLOPs level, \texttt{PASS} subnetworks achieve $1\%\sim 3\%$ better accuracy on Food101 dataset; or with a similar performance of $80\%$ accuracy, \texttt{PASS} subnetworks obtain $0.35\times$ more speedup than the baselines.

(PASS) Visual Prompt Locates Good Structure Sparsity through a Recurrent HyperNetwork

TL;DR

This work tackles the efficiency challenge of large vision models by proposing PASS, a data-centric channel pruning framework that uses visual prompts to guide the estimation of layer-wise channel importance. A recurrent hypernetwork, implemented with an LSTM, generates masks conditioned on weight statistics and visual prompts , applying masks to produce sparse weights via . The visual-prompt encoder initializes the LSTM, channel scores are obtained through a linear mapper with straight-through discretization, and global pruning determines layer-wide sparsity, all optimized jointly over , , and before fine-tuning the sparse subnetworks. Empirical results across six datasets and four architectures show that PASS yields higher accuracy at the same FLOPs and provides substantial transferability of both the sparse structure and the hypernetwork, with ablations confirming the necessity of visual prompts, recurrence, and the chosen pruning strategy. Overall, PASS demonstrates that data-centric prompts can effectively guide structural sparsification, enabling efficient deployment of large CNNs and suggesting a general pathway for prompt-informed model compression.

Abstract

Large-scale neural networks have demonstrated remarkable performance in different domains like vision and language processing, although at the cost of massive computation resources. As illustrated by compression literature, structural model pruning is a prominent algorithm to encourage model efficiency, thanks to its acceleration-friendly sparsity patterns. One of the key questions of structural pruning is how to estimate the channel significance. In parallel, work on data-centric AI has shown that prompting-based techniques enable impressive generalization of large language models across diverse downstream tasks. In this paper, we investigate a charming possibility - \textit{leveraging visual prompts to capture the channel importance and derive high-quality structural sparsity}. To this end, we propose a novel algorithmic framework, namely \texttt{PASS}. It is a tailored hyper-network to take both visual prompts and network weight statistics as input, and output layer-wise channel sparsity in a recurrent manner. Such designs consider the intrinsic channel dependency between layers. Comprehensive experiments across multiple network architectures and six datasets demonstrate the superiority of \texttt{PASS} in locating good structural sparsity. For example, at the same FLOPs level, \texttt{PASS} subnetworks achieve better accuracy on Food101 dataset; or with a similar performance of accuracy, \texttt{PASS} subnetworks obtain more speedup than the baselines.
Paper Structure (20 sections, 3 equations, 5 figures, 8 tables)

This paper contains 20 sections, 3 equations, 5 figures, 8 tables.

Figures (5)

  • Figure 1: The overall framework of PASS. (Left) Our pruning target is a convolutional neural network (CNN) that takes images and visual prompts as input. (Right) The PASS hyper-network integrates the information from visual prompts and layer-wise weight statistics, then determines the significant structural topologies in a recurrent fashion.
  • Figure 2: Test accuracy of channel-pruned networks across multiple downstream tasks based on the pre-trained ResNet-18 model.
  • Figure 3: Test accuracy of channel-pruned networks across various architectures based on CIFAR-$100$ and Tiny-ImageNet datasets.
  • Figure 4: Ablation study on visual prompt strategies and their sizes. Experiments are conducted on CIFAR-100 and a pre-trained ResNet-18.
  • Figure 5: ($1$)Ablation study of the hypernetwork's hidden size (Left Figure) using a pre-trained ResNet-$18$ on CIFAR-$100$. ($2$)Comparison between Global Pruning and Uniform Pruning strategies (Middle and Right Figures) using a pre-trained ResNet-18 on CIFAR-$100$ and Tiny-Imagenet.