Table of Contents
Fetching ...

Entropy Reveals Block Importance in Masked Self-Supervised Vision Transformers

Peihao Xiang, Kaida Wu, Ou Bai

TL;DR

The paper tackles the challenge of deploying large masked self-supervised vision transformers by questioning whether all transformer blocks are equally important for downstream transfer. It introduces weight-number entropy, an information-theoretic measure computed from pretrained weights, as a data-free surrogate for block importance, and presents Gardener, a one-shot pruning algorithm that removes low-entropy blocks without data or iterative finetuning. Through experiments on VideoMAE-B pretrained on Kinetics-400 and finetuned on UCF101, the approach achieves near-oracle performance, reveals substantial block-level redundancy, and demonstrates practical gains in transfer efficiency with minimal overhead. Overall, the work provides a principled, data-free pathway to compress and adapt large self-supervised vision transformers for resource-constrained deployment and rapid transfer learning.

Abstract

Masked self-supervised vision transformers have become a dominant pretraining paradigm, yet their substantial model size poses significant challenges for resource-constrained deployment and efficient transfer learning. A fundamental question remains: are all transformer blocks equally important for downstream performance? In this paper, we show that block importance in masked self-supervised vision transformers can be accurately estimated without access to any data. Our key finding is that the information entropy of pretrained block weights strongly correlates with oracle sensitivity obtained via iterative block removal and finetuning. This observation enables Gardener, a data-free, one-shot, block-level pruning principle that identifies redundant blocks through simple information-theoretic measurements. We evaluate Gardener on VideoMAE-B across multiple pruning ratios and downstream video recognition benchmarks. Despite its negligible computational overhead, Gardener consistently matches or outperforms existing data-free pruning baselines and closely approaches sensitivity-based pruning. Remarkably, even after pruning up to 91.7\% of blocks, the pruned model retains competitive transfer performance. Our results reveal substantial block-level redundancy in masked self-supervised vision transformers and demonstrate that information-theoretic analysis offers a principled and efficient pathway for model compression and resource-efficient transfer learning.

Entropy Reveals Block Importance in Masked Self-Supervised Vision Transformers

TL;DR

The paper tackles the challenge of deploying large masked self-supervised vision transformers by questioning whether all transformer blocks are equally important for downstream transfer. It introduces weight-number entropy, an information-theoretic measure computed from pretrained weights, as a data-free surrogate for block importance, and presents Gardener, a one-shot pruning algorithm that removes low-entropy blocks without data or iterative finetuning. Through experiments on VideoMAE-B pretrained on Kinetics-400 and finetuned on UCF101, the approach achieves near-oracle performance, reveals substantial block-level redundancy, and demonstrates practical gains in transfer efficiency with minimal overhead. Overall, the work provides a principled, data-free pathway to compress and adapt large self-supervised vision transformers for resource-constrained deployment and rapid transfer learning.

Abstract

Masked self-supervised vision transformers have become a dominant pretraining paradigm, yet their substantial model size poses significant challenges for resource-constrained deployment and efficient transfer learning. A fundamental question remains: are all transformer blocks equally important for downstream performance? In this paper, we show that block importance in masked self-supervised vision transformers can be accurately estimated without access to any data. Our key finding is that the information entropy of pretrained block weights strongly correlates with oracle sensitivity obtained via iterative block removal and finetuning. This observation enables Gardener, a data-free, one-shot, block-level pruning principle that identifies redundant blocks through simple information-theoretic measurements. We evaluate Gardener on VideoMAE-B across multiple pruning ratios and downstream video recognition benchmarks. Despite its negligible computational overhead, Gardener consistently matches or outperforms existing data-free pruning baselines and closely approaches sensitivity-based pruning. Remarkably, even after pruning up to 91.7\% of blocks, the pruned model retains competitive transfer performance. Our results reveal substantial block-level redundancy in masked self-supervised vision transformers and demonstrate that information-theoretic analysis offers a principled and efficient pathway for model compression and resource-efficient transfer learning.
Paper Structure (20 sections, 3 equations, 3 figures, 12 tables, 1 algorithm)

This paper contains 20 sections, 3 equations, 3 figures, 12 tables, 1 algorithm.

Figures (3)

  • Figure 1: Pruning behavior of masked self-supervised vision transformers on VideoMAE-B finetuned on UCF101. (a) Block-wise pruning results, where index 0 denotes the original model and indices 1–12 correspond to removing individual transformer blocks. The dashed line indicates the unpruned baseline. The results reveal substantial block-level heterogeneity. (b) Multi-block pruning performance under increasing pruning ratios. Entropy-based pruning (Gardener) consistently tracks sensitivity-based (oracle) pruning and outperforms other data-free criteria across a wide range of pruning ratios.
  • Figure 2: Architectural diagram of the information entropy-based Block-level Gardener pruning algorithm. Step 1: Visual self-supervised learning; Step 2: Calculating pruning criteria at the Block-level; Step 3: Pruning process of the VideoMAE Encoder Pretrained Model; Step 4: Fine-tuning the Pruned VideoMAE Encoder.
  • Figure 3: Weight distributions of transformer blocks at different depths in a pretrained VideoMAE encoder. Early blocks exhibit sharply peaked distributions, while deeper blocks show increasingly flatter and more dispersed parameter distributions.