Table of Contents
Fetching ...

Back to Fundamentals: Low-Level Visual Features Guided Progressive Token Pruning

Yuanbing Ouyang, Yizhuo Liang, Qingpeng Li, Xinfei Guo, Yiming Luo, Di Wu, Hao Wang, Yushan Pan

TL;DR

This work tackles the computational burden of Vision Transformers in semantic segmentation by proposing LVTP, a progressive token pruning framework guided by multi-scale Tsallis entropy and augmented with low-level Sobel-edge features. It introduces a dynamic, entropy-based scoring mechanism and a two-stage clustering that preserves semantically important tokens while retaining edge information, all without retraining. Across cross-domain datasets and backbones, LVTP achieves substantial GFLOPS reductions (roughly 20–46%) with minimal mIoU degradation and high efficiency trade-offs (high $b$), outperforming existing pruning methods. The approach demonstrates strong generalizability to VeRT backbones and segmentors and offers practical applicability for edge devices, with potential extensions to broader visual tasks and hardware-aware optimization.

Abstract

Vision Transformers (ViTs) excel in semantic segmentation but demand significant computation, posing challenges for deployment on resource-constrained devices. Existing token pruning methods often overlook fundamental visual data characteristics. This study introduces 'LVTP', a progressive token pruning framework guided by multi-scale Tsallis entropy and low-level visual features with twice clustering. It integrates high-level semantics and basic visual attributes for precise segmentation. A novel dynamic scoring mechanism using multi-scale Tsallis entropy weighting overcomes limitations of traditional single-parameter entropy. The framework also incorporates low-level feature analysis to preserve critical edge information while optimizing computational cost. As a plug-and-play module, it requires no architectural changes or additional training. Evaluations across multiple datasets show 20%-45% computational reductions with negligible performance loss, outperforming existing methods in balancing cost and accuracy, especially in complex edge regions.

Back to Fundamentals: Low-Level Visual Features Guided Progressive Token Pruning

TL;DR

This work tackles the computational burden of Vision Transformers in semantic segmentation by proposing LVTP, a progressive token pruning framework guided by multi-scale Tsallis entropy and augmented with low-level Sobel-edge features. It introduces a dynamic, entropy-based scoring mechanism and a two-stage clustering that preserves semantically important tokens while retaining edge information, all without retraining. Across cross-domain datasets and backbones, LVTP achieves substantial GFLOPS reductions (roughly 20–46%) with minimal mIoU degradation and high efficiency trade-offs (high ), outperforming existing pruning methods. The approach demonstrates strong generalizability to VeRT backbones and segmentors and offers practical applicability for edge devices, with potential extensions to broader visual tasks and hardware-aware optimization.

Abstract

Vision Transformers (ViTs) excel in semantic segmentation but demand significant computation, posing challenges for deployment on resource-constrained devices. Existing token pruning methods often overlook fundamental visual data characteristics. This study introduces 'LVTP', a progressive token pruning framework guided by multi-scale Tsallis entropy and low-level visual features with twice clustering. It integrates high-level semantics and basic visual attributes for precise segmentation. A novel dynamic scoring mechanism using multi-scale Tsallis entropy weighting overcomes limitations of traditional single-parameter entropy. The framework also incorporates low-level feature analysis to preserve critical edge information while optimizing computational cost. As a plug-and-play module, it requires no architectural changes or additional training. Evaluations across multiple datasets show 20%-45% computational reductions with negligible performance loss, outperforming existing methods in balancing cost and accuracy, especially in complex edge regions.

Paper Structure

This paper contains 24 sections, 11 equations, 5 figures, 6 tables.

Figures (5)

  • Figure 1: Framework for Edge-Enhanced Token Clustering in Transformers
  • Figure 2: Edge Enhancement Example
  • Figure 3: Pruned SAM-ViT-H's prediction from RIO and COCO dataset
  • Figure 4: Pruned Swin-Unet-L's prediction from Massachusetts-Road dataset
  • Figure 5: Framework for Edge-Enhanced Token Clustering in Transformers