GridPrune: From "Where to Look" to "What to Select" in Visual Token Pruning for MLLMs

Yuxiang Duan; Ao Li; Yingqin Li; Luyu Li; Pengwei Wang

GridPrune: From "Where to Look" to "What to Select" in Visual Token Pruning for MLLMs

Yuxiang Duan, Ao Li, Yingqin Li, Luyu Li, Pengwei Wang

TL;DR

The paper tackles the high computational cost of visual tokens in multimodal LLMs by proposing GridPrune, a training-free two-stage pruning method. It introduces a two-stage budget mechanism: first, text-guided allocation of a token budget across spatial zones (where to look), then intra-zone selection using a fused score that combines Text-Conditional Relevance and Intrinsic Visual Saliency, expressed as $s_i = (1 - \alpha)\hat{r_i} + \alpha a_i$ with $\hat{r_i} = (r_i + 1)/2$. Zone budgets are determined via a softmax over zone relevance $P_j = \frac{\exp(\bar{r}_j)}{\sum_m \exp(\bar{r}_m)}$ and rounded to ensure $\sum_j k_j = k$, after which the top-$k_j$ tokens within each zone are selected. Empirically, GridPrune achieves near-full performance with a fraction of tokens (e.g., 11.1% tokens retaining around 97%+ across several benchmarks) and delivers substantial speedups (TFLOPs and latency reductions) across models like LLaVA-1.5-7B, LLaVA-NeXT-7B, and Qwen2.5-VL-7B, validating the practical impact of modeling the 'where to look' stage in visual token pruning.

Abstract

Multimodal large language models (MLLMs) have shown remarkable capabilities in a wide range of vision-language tasks. However, the large number of visual tokens introduces significant computational overhead. To address this issue, visual token pruning has emerged as a key technique for enhancing the efficiency of MLLMs. In cognitive science, humans tend to first determine which regions of a scene to attend to ("where to look") before deciding which specific elements within those regions to process in detail ("what to select"). This two-stage strategy enables the visual system to efficiently allocate attention at a coarse spatial level before performing fine-grained selection. However, existing pruning methods primarily focus on directly optimizing "what to select", typically using attention scores or similarity metrics. They rarely consider "where to look", which has been shown to lead to inefficient spatial allocation, positional bias, and the retention of irrelevant or redundant tokens. In this paper, we propose GridPrune, a method that replaces the global Top-K mechanism with a "guide-globally, select-locally" zonal selection system. GridPrune splits the pruning process into two steps: first, it uses text-conditional guidance to dynamically allocate a token budget across spatial zones; and then, it performs local selection within each budgeted zone. Experimental results demonstrate that GridPrune achieves superior performance across various MLLM architectures. On LLaVA-NeXT-7B, GridPrune retains 96.98% of the full performance while using 11.1% of the tokens, outperforming the best-performing baseline by 2.34% at the same pruning rate.

GridPrune: From "Where to Look" to "What to Select" in Visual Token Pruning for MLLMs

TL;DR

Abstract

GridPrune: From "Where to Look" to "What to Select" in Visual Token Pruning for MLLMs

TL;DR

Abstract

Paper Structure

Table of Contents

Figures (5)