Back to Fundamentals: Low-Level Visual Features Guided Progressive Token Pruning
Yuanbing Ouyang, Yizhuo Liang, Qingpeng Li, Xinfei Guo, Yiming Luo, Di Wu, Hao Wang, Yushan Pan
TL;DR
This work tackles the computational burden of Vision Transformers in semantic segmentation by proposing LVTP, a progressive token pruning framework guided by multi-scale Tsallis entropy and augmented with low-level Sobel-edge features. It introduces a dynamic, entropy-based scoring mechanism and a two-stage clustering that preserves semantically important tokens while retaining edge information, all without retraining. Across cross-domain datasets and backbones, LVTP achieves substantial GFLOPS reductions (roughly 20–46%) with minimal mIoU degradation and high efficiency trade-offs (high $b$), outperforming existing pruning methods. The approach demonstrates strong generalizability to VeRT backbones and segmentors and offers practical applicability for edge devices, with potential extensions to broader visual tasks and hardware-aware optimization.
Abstract
Vision Transformers (ViTs) excel in semantic segmentation but demand significant computation, posing challenges for deployment on resource-constrained devices. Existing token pruning methods often overlook fundamental visual data characteristics. This study introduces 'LVTP', a progressive token pruning framework guided by multi-scale Tsallis entropy and low-level visual features with twice clustering. It integrates high-level semantics and basic visual attributes for precise segmentation. A novel dynamic scoring mechanism using multi-scale Tsallis entropy weighting overcomes limitations of traditional single-parameter entropy. The framework also incorporates low-level feature analysis to preserve critical edge information while optimizing computational cost. As a plug-and-play module, it requires no architectural changes or additional training. Evaluations across multiple datasets show 20%-45% computational reductions with negligible performance loss, outperforming existing methods in balancing cost and accuracy, especially in complex edge regions.
