VLTP: Vision-Language Guided Token Pruning for Task-Oriented Segmentation

Hanning Chen; Yang Ni; Wenjun Huang; Yezi Liu; SungHeon Jeong; Fei Wen; Nathaniel Bastian; Hugo Latapie; Mohsen Imani

VLTP: Vision-Language Guided Token Pruning for Task-Oriented Segmentation

Hanning Chen, Yang Ni, Wenjun Huang, Yezi Liu, SungHeon Jeong, Fei Wen, Nathaniel Bastian, Hugo Latapie, Mohsen Imani

TL;DR

The paper addresses the high computational cost of Vision Transformer-based segmentation in Task-Oriented Segmentation (TOS) by introducing VLTP, a token pruning framework guided by vision-language reasoning from an MLLM. VLTP uses a lightweight prune decoder integrated at multiple ViT layers to score token relevance with respect to a SEG guidance token produced by an MLLM, pruning low-relevance tokens and reactivating them later to preserve accuracy. The approach yields substantial efficiency gains, reducing GFLOPs by about $25\%$ without performance loss and up to $40\%$ with only around a $1\%$ mIoU drop, while achieving state-of-the-art mIoU improvements on RIO and COCO-Tasks datasets. This work demonstrates the practical impact of combining vision-language reasoning with targeted token pruning to accelerate ViT-based segmentation in complex, task-driven scenarios, enabling more efficient deployment in real-world applications.

Abstract

Vision Transformers (ViTs) have emerged as the backbone of many segmentation models, consistently achieving state-of-the-art (SOTA) performance. However, their success comes at a significant computational cost. Image token pruning is one of the most effective strategies to address this complexity. However, previous approaches fall short when applied to more complex task-oriented segmentation (TOS), where the class of each image patch is not predefined but dependent on the specific input task. This work introduces the Vision Language Guided Token Pruning (VLTP), a novel token pruning mechanism that can accelerate ViT-based segmentation models, particularly for TOS guided by multi-modal large language model (MLLM). We argue that ViT does not need to process every image token through all of its layers -- only the tokens related to reasoning tasks are necessary. We design a new pruning decoder to take both image tokens and vision-language guidance as input to predict the relevance of each image token to the task. Only image tokens with high relevance are passed to deeper layers of the ViT. Experiments show that the VLTP framework reduces the computational costs of ViT by approximately 25% without performance degradation and by around 40% with only a 1% performance drop. The code associated with this study can be found at this URL.

VLTP: Vision-Language Guided Token Pruning for Task-Oriented Segmentation

TL;DR

without performance loss and up to

with only around a

mIoU drop, while achieving state-of-the-art mIoU improvements on RIO and COCO-Tasks datasets. This work demonstrates the practical impact of combining vision-language reasoning with targeted token pruning to accelerate ViT-based segmentation in complex, task-driven scenarios, enabling more efficient deployment in real-world applications.

Abstract

Paper Structure (18 sections, 13 equations, 6 figures, 8 tables)

This paper contains 18 sections, 13 equations, 6 figures, 8 tables.

Introduction
Related Works
Image Segmentation
Vision Transformer for Segmentation
Token Pruning
Task-oriented Segmentation
Vision-Language Guided Patch Pruning
Preliminary: Segmentation Model
Token Pruning Mechanism
Prune Decoder Design and Training
Pruned Tokens Reactivation
Experiments
Dataset and Metrics
Vision Language Fintuning
VLTP Framework Setup
...and 3 more sections

Figures (6)

Figure 1: (a) Semantic segmentation example. (b) TOS example. (c) For the same image, the segmentation mask and corresponding image patches change when the input task changes.
Figure 2: Multi-model LLM (MLLM) guide segmentation model for TOS.
Figure 3: The illustration of vision-language guided token pruning (VLTP) framework. (a) ViT architecture with prune decoder. (b) Prune decoder model architecture. In this illustration, we consider an image with only 9 patch tokens.
Figure 4: Visualization of VLTP image patch pruning for SAM ViT-H at layers 16 and 24. Three distinct pruning rates (0.5, 0.7, and 0.8) are illustrated alongside the ground truth (GT) task-related image patches.
Figure 5: Visualization of VLTP image patch pruning for SAM ViT-H at layers 8, 16, and 24 along with the ground truth (GT). The pruning rate is 0.7.
...and 1 more figures

VLTP: Vision-Language Guided Token Pruning for Task-Oriented Segmentation

TL;DR

Abstract

VLTP: Vision-Language Guided Token Pruning for Task-Oriented Segmentation

Authors

TL;DR

Abstract

Table of Contents

Figures (6)