Entropy Reveals Block Importance in Masked Self-Supervised Vision Transformers
Peihao Xiang, Kaida Wu, Ou Bai
TL;DR
The paper tackles the challenge of deploying large masked self-supervised vision transformers by questioning whether all transformer blocks are equally important for downstream transfer. It introduces weight-number entropy, an information-theoretic measure computed from pretrained weights, as a data-free surrogate for block importance, and presents Gardener, a one-shot pruning algorithm that removes low-entropy blocks without data or iterative finetuning. Through experiments on VideoMAE-B pretrained on Kinetics-400 and finetuned on UCF101, the approach achieves near-oracle performance, reveals substantial block-level redundancy, and demonstrates practical gains in transfer efficiency with minimal overhead. Overall, the work provides a principled, data-free pathway to compress and adapt large self-supervised vision transformers for resource-constrained deployment and rapid transfer learning.
Abstract
Masked self-supervised vision transformers have become a dominant pretraining paradigm, yet their substantial model size poses significant challenges for resource-constrained deployment and efficient transfer learning. A fundamental question remains: are all transformer blocks equally important for downstream performance? In this paper, we show that block importance in masked self-supervised vision transformers can be accurately estimated without access to any data. Our key finding is that the information entropy of pretrained block weights strongly correlates with oracle sensitivity obtained via iterative block removal and finetuning. This observation enables Gardener, a data-free, one-shot, block-level pruning principle that identifies redundant blocks through simple information-theoretic measurements. We evaluate Gardener on VideoMAE-B across multiple pruning ratios and downstream video recognition benchmarks. Despite its negligible computational overhead, Gardener consistently matches or outperforms existing data-free pruning baselines and closely approaches sensitivity-based pruning. Remarkably, even after pruning up to 91.7\% of blocks, the pruned model retains competitive transfer performance. Our results reveal substantial block-level redundancy in masked self-supervised vision transformers and demonstrate that information-theoretic analysis offers a principled and efficient pathway for model compression and resource-efficient transfer learning.
