VLTP: Vision-Language Guided Token Pruning for Task-Oriented Segmentation
Hanning Chen, Yang Ni, Wenjun Huang, Yezi Liu, SungHeon Jeong, Fei Wen, Nathaniel Bastian, Hugo Latapie, Mohsen Imani
TL;DR
The paper addresses the high computational cost of Vision Transformer-based segmentation in Task-Oriented Segmentation (TOS) by introducing VLTP, a token pruning framework guided by vision-language reasoning from an MLLM. VLTP uses a lightweight prune decoder integrated at multiple ViT layers to score token relevance with respect to a SEG guidance token produced by an MLLM, pruning low-relevance tokens and reactivating them later to preserve accuracy. The approach yields substantial efficiency gains, reducing GFLOPs by about $25\%$ without performance loss and up to $40\%$ with only around a $1\%$ mIoU drop, while achieving state-of-the-art mIoU improvements on RIO and COCO-Tasks datasets. This work demonstrates the practical impact of combining vision-language reasoning with targeted token pruning to accelerate ViT-based segmentation in complex, task-driven scenarios, enabling more efficient deployment in real-world applications.
Abstract
Vision Transformers (ViTs) have emerged as the backbone of many segmentation models, consistently achieving state-of-the-art (SOTA) performance. However, their success comes at a significant computational cost. Image token pruning is one of the most effective strategies to address this complexity. However, previous approaches fall short when applied to more complex task-oriented segmentation (TOS), where the class of each image patch is not predefined but dependent on the specific input task. This work introduces the Vision Language Guided Token Pruning (VLTP), a novel token pruning mechanism that can accelerate ViT-based segmentation models, particularly for TOS guided by multi-modal large language model (MLLM). We argue that ViT does not need to process every image token through all of its layers -- only the tokens related to reasoning tasks are necessary. We design a new pruning decoder to take both image tokens and vision-language guidance as input to predict the relevance of each image token to the task. Only image tokens with high relevance are passed to deeper layers of the ViT. Experiments show that the VLTP framework reduces the computational costs of ViT by approximately 25% without performance degradation and by around 40% with only a 1% performance drop. The code associated with this study can be found at this URL.
