Table of Contents
Fetching ...

Stratified Knowledge-Density Super-Network for Scalable Vision Transformers

Longhua Li, Lei Qi, Xin Geng

TL;DR

This paper tackles the challenge of deploying vision transformers across a wide range of resource constraints by turning a pre-trained ViT into a Stratified Knowledge-Density (SKD) Super-Network. It introduces Weighted PCA for Attention Concentration (WPAC) to concentrate knowledge into a compact set of dimensions while preserving the original network function, and Progressive Importance-Aware Dropout (PIAD) to promote stratified knowledge organization through importance-guided sub-network training. The combination enables cost-free extraction of sub-networks of arbitrary sizes that retain maximal knowledge, achieving strong results against both pruning-based and expansion-based baselines, and offering robust transferability to downstream tasks. This approach provides a scalable, flexible, and efficient pathway for deploying ViTs across diverse hardware regimes with reduced fine-tuning and maintenance costs.

Abstract

Training and deploying multiple vision transformer (ViT) models for different resource constraints is costly and inefficient. To address this, we propose transforming a pre-trained ViT into a stratified knowledge-density super-network, where knowledge is hierarchically organized across weights. This enables flexible extraction of sub-networks that retain maximal knowledge for varying model sizes. We introduce \textbf{W}eighted \textbf{P}CA for \textbf{A}ttention \textbf{C}ontraction (WPAC), which concentrates knowledge into a compact set of critical weights. WPAC applies token-wise weighted principal component analysis to intermediate features and injects the resulting transformation and inverse matrices into adjacent layers, preserving the original network function while enhancing knowledge compactness. To further promote stratified knowledge organization, we propose \textbf{P}rogressive \textbf{I}mportance-\textbf{A}ware \textbf{D}ropout (PIAD). PIAD progressively evaluates the importance of weight groups, updates an importance-aware dropout list, and trains the super-network under this dropout regime to promote knowledge stratification. Experiments demonstrate that WPAC outperforms existing pruning criteria in knowledge concentration, and the combination with PIAD offers a strong alternative to state-of-the-art model compression and model expansion methods.

Stratified Knowledge-Density Super-Network for Scalable Vision Transformers

TL;DR

This paper tackles the challenge of deploying vision transformers across a wide range of resource constraints by turning a pre-trained ViT into a Stratified Knowledge-Density (SKD) Super-Network. It introduces Weighted PCA for Attention Concentration (WPAC) to concentrate knowledge into a compact set of dimensions while preserving the original network function, and Progressive Importance-Aware Dropout (PIAD) to promote stratified knowledge organization through importance-guided sub-network training. The combination enables cost-free extraction of sub-networks of arbitrary sizes that retain maximal knowledge, achieving strong results against both pruning-based and expansion-based baselines, and offering robust transferability to downstream tasks. This approach provides a scalable, flexible, and efficient pathway for deploying ViTs across diverse hardware regimes with reduced fine-tuning and maintenance costs.

Abstract

Training and deploying multiple vision transformer (ViT) models for different resource constraints is costly and inefficient. To address this, we propose transforming a pre-trained ViT into a stratified knowledge-density super-network, where knowledge is hierarchically organized across weights. This enables flexible extraction of sub-networks that retain maximal knowledge for varying model sizes. We introduce \textbf{W}eighted \textbf{P}CA for \textbf{A}ttention \textbf{C}ontraction (WPAC), which concentrates knowledge into a compact set of critical weights. WPAC applies token-wise weighted principal component analysis to intermediate features and injects the resulting transformation and inverse matrices into adjacent layers, preserving the original network function while enhancing knowledge compactness. To further promote stratified knowledge organization, we propose \textbf{P}rogressive \textbf{I}mportance-\textbf{A}ware \textbf{D}ropout (PIAD). PIAD progressively evaluates the importance of weight groups, updates an importance-aware dropout list, and trains the super-network under this dropout regime to promote knowledge stratification. Experiments demonstrate that WPAC outperforms existing pruning criteria in knowledge concentration, and the combination with PIAD offers a strong alternative to state-of-the-art model compression and model expansion methods.

Paper Structure

This paper contains 30 sections, 7 equations, 5 figures, 7 tables.

Figures (5)

  • Figure 1: An overview of the proposed method. Traditional methods require separate compression for each deployment setting. In contrast, our method builds a Stratified Knowledge-Density (SKD) Super-Network in a single pass, enabling cost-free sub-network extraction for arbitrary model sizes.
  • Figure 2: An overview of the proposed WPAC. (a) Principal component projection matrices are computed using importance-weighted intermediate features. (b) The resulting transformation matrices are applied to the pre-trained weights. (c) The transformed linear layers can preserve the original neural function using only a small number of principal dimensions.
  • Figure 3: An overview of the proposed PIAD. Step 1. At the start of each epoch, evaluate the importance of parameter groups not yet in the dropout list and append the least important ones. Step 2. During training, sample sub-networks by randomly dropping the least important units from the dropout list and train them, propagating gradients back to the SKD Super-Netwrok.
  • Figure 4: Comparison of weighting strategies with different granularities across varying numbers of retained dimensions.
  • Figure 5: Impact of proxy set size on WPAC performance. Each size is evaluated using 5 random seeds, and the results reflect performance with head dimensions reduced by half.