Table of Contents
Fetching ...

ToaSt: Token Channel Selection and Structured Pruning for Efficient ViT

Hyunchan Moon, Cheonjun Park, Steven L. Waslander

TL;DR

This work proposes ToaSt, a decoupled framework applying specialized strategies to distinct ViT components, and applies coupled head-wise structured pruning to Multi-Head Self-Attention modules, leveraging attention operation characteristics to enhance robustness.

Abstract

Vision Transformers (ViTs) have achieved remarkable success across various vision tasks, yet their deployment is often hindered by prohibitive computational costs. While structured weight pruning and token compression have emerged as promising solutions, they suffer from prolonged retraining times and global propagation that creates optimization challenges, respectively. We propose ToaSt, a decoupled framework applying specialized strategies to distinct ViT components. We apply coupled head-wise structured pruning to Multi-Head Self-Attention modules, leveraging attention operation characteristics to enhance robustness. For Feed-Forward Networks (over 60\% of FLOPs), we introduce Token Channel Selection (TCS) that enhances compression ratios while avoiding global propagation issues. Our analysis reveals TCS effectively filters redundant noise during selection. Extensive evaluations across nine diverse models, including DeiT, ViT-MAE, and Swin Transformer, demonstrate that ToaSt achieves superior trade-offs between accuracy and efficiency, consistently outperforming existing baselines. On ViT-MAE-Huge, ToaSt achieves 88.52\% accuracy (+1.64 \%) with 39.4\% FLOPs reduction. ToaSt transfers effectively to downstream tasks, achieving 52.2 versus 51.9 mAP on COCO object detection. Code and models will be released upon acceptance.

ToaSt: Token Channel Selection and Structured Pruning for Efficient ViT

TL;DR

This work proposes ToaSt, a decoupled framework applying specialized strategies to distinct ViT components, and applies coupled head-wise structured pruning to Multi-Head Self-Attention modules, leveraging attention operation characteristics to enhance robustness.

Abstract

Vision Transformers (ViTs) have achieved remarkable success across various vision tasks, yet their deployment is often hindered by prohibitive computational costs. While structured weight pruning and token compression have emerged as promising solutions, they suffer from prolonged retraining times and global propagation that creates optimization challenges, respectively. We propose ToaSt, a decoupled framework applying specialized strategies to distinct ViT components. We apply coupled head-wise structured pruning to Multi-Head Self-Attention modules, leveraging attention operation characteristics to enhance robustness. For Feed-Forward Networks (over 60\% of FLOPs), we introduce Token Channel Selection (TCS) that enhances compression ratios while avoiding global propagation issues. Our analysis reveals TCS effectively filters redundant noise during selection. Extensive evaluations across nine diverse models, including DeiT, ViT-MAE, and Swin Transformer, demonstrate that ToaSt achieves superior trade-offs between accuracy and efficiency, consistently outperforming existing baselines. On ViT-MAE-Huge, ToaSt achieves 88.52\% accuracy (+1.64 \%) with 39.4\% FLOPs reduction. ToaSt transfers effectively to downstream tasks, achieving 52.2 versus 51.9 mAP on COCO object detection. Code and models will be released upon acceptance.
Paper Structure (20 sections, 6 equations, 5 figures, 4 tables)

This paper contains 20 sections, 6 equations, 5 figures, 4 tables.

Figures (5)

  • Figure 1: ToaSt compression methodology. (a) Standard ViT block architecture. (b) Token compression propagates compression effects across layers due to inter-layer dependencies. (c) ToaSt independently compresses each layer through coupled weight pruning (MHSA) and token channel selection (FFN), preventing cross-layer propagation while reducing $d_k$ and $D$ dimensions.
  • Figure 2: Overview of the ToaSt framework for layer-independent compression.(a) Structured Coupled MHSA Weight Pruning: Pruning indices are synchronized across coupled groups (Q-K and V-Proj) to reduce the internal head dimension $d_k$ while preserving the attention mechanism's functional integrity. (b) Token Channel Selection for FFN: Redundant channels in the intermediate FFN layer are identified and eliminated based on feature importance analysis, maintaining the global embedding dimension $D$ at the block interface.
  • Figure 3: Impact of coupled index synchronization on accuracy. Compared to non-aligned pruning, our synchronized Q-K and V-Proj pruning significantly mitigates accuracy drop, especially at high pruning ratios. This empirical evidence justifies the necessity of our structural constraints.
  • Figure 4: Layer-wise redundancy analysis of Swin-Base FFN. (Left) Sparsity increases in deeper stages, indicating many "dead neurons." (Center) Linear Reconstruction $R^2$ remains near 1.0, proving that feature channels are highly dependent. (Right) Effective Rank collapses in later stages, confirming that the $4D$ expansion contains massive redundancy.
  • Figure 5: Layer-wise FFN TCS Sensitivity Analysis. Sensitivity analysis of FC1, FC2, and combined pruning across DeiT-Small layers at various ratios (10%-90%). (a) FC1 shows high sensitivity in early layers but robustness in later layers (L9-11), with L11 improving accuracy up to 80% pruning. (b) FC2 exhibits lower sensitivity, enabling aggressive pruning (50-90%) in later layers. (c) Combined pruning validates asymmetric layer-adaptive ratios exploiting distinct redundancy patterns between FC1 and FC2.