Table of Contents
Fetching ...

GridPrune: From "Where to Look" to "What to Select" in Visual Token Pruning for MLLMs

Yuxiang Duan, Ao Li, Yingqin Li, Luyu Li, Pengwei Wang

TL;DR

The paper tackles the high computational cost of visual tokens in multimodal LLMs by proposing GridPrune, a training-free two-stage pruning method. It introduces a two-stage budget mechanism: first, text-guided allocation of a token budget across spatial zones (where to look), then intra-zone selection using a fused score that combines Text-Conditional Relevance and Intrinsic Visual Saliency, expressed as $s_i = (1 - \alpha)\hat{r_i} + \alpha a_i$ with $\hat{r_i} = (r_i + 1)/2$. Zone budgets are determined via a softmax over zone relevance $P_j = \frac{\exp(\bar{r}_j)}{\sum_m \exp(\bar{r}_m)}$ and rounded to ensure $\sum_j k_j = k$, after which the top-$k_j$ tokens within each zone are selected. Empirically, GridPrune achieves near-full performance with a fraction of tokens (e.g., 11.1% tokens retaining around 97%+ across several benchmarks) and delivers substantial speedups (TFLOPs and latency reductions) across models like LLaVA-1.5-7B, LLaVA-NeXT-7B, and Qwen2.5-VL-7B, validating the practical impact of modeling the 'where to look' stage in visual token pruning.

Abstract

Multimodal large language models (MLLMs) have shown remarkable capabilities in a wide range of vision-language tasks. However, the large number of visual tokens introduces significant computational overhead. To address this issue, visual token pruning has emerged as a key technique for enhancing the efficiency of MLLMs. In cognitive science, humans tend to first determine which regions of a scene to attend to ("where to look") before deciding which specific elements within those regions to process in detail ("what to select"). This two-stage strategy enables the visual system to efficiently allocate attention at a coarse spatial level before performing fine-grained selection. However, existing pruning methods primarily focus on directly optimizing "what to select", typically using attention scores or similarity metrics. They rarely consider "where to look", which has been shown to lead to inefficient spatial allocation, positional bias, and the retention of irrelevant or redundant tokens. In this paper, we propose GridPrune, a method that replaces the global Top-K mechanism with a "guide-globally, select-locally" zonal selection system. GridPrune splits the pruning process into two steps: first, it uses text-conditional guidance to dynamically allocate a token budget across spatial zones; and then, it performs local selection within each budgeted zone. Experimental results demonstrate that GridPrune achieves superior performance across various MLLM architectures. On LLaVA-NeXT-7B, GridPrune retains 96.98% of the full performance while using 11.1% of the tokens, outperforming the best-performing baseline by 2.34% at the same pruning rate.

GridPrune: From "Where to Look" to "What to Select" in Visual Token Pruning for MLLMs

TL;DR

The paper tackles the high computational cost of visual tokens in multimodal LLMs by proposing GridPrune, a training-free two-stage pruning method. It introduces a two-stage budget mechanism: first, text-guided allocation of a token budget across spatial zones (where to look), then intra-zone selection using a fused score that combines Text-Conditional Relevance and Intrinsic Visual Saliency, expressed as with . Zone budgets are determined via a softmax over zone relevance and rounded to ensure , after which the top- tokens within each zone are selected. Empirically, GridPrune achieves near-full performance with a fraction of tokens (e.g., 11.1% tokens retaining around 97%+ across several benchmarks) and delivers substantial speedups (TFLOPs and latency reductions) across models like LLaVA-1.5-7B, LLaVA-NeXT-7B, and Qwen2.5-VL-7B, validating the practical impact of modeling the 'where to look' stage in visual token pruning.

Abstract

Multimodal large language models (MLLMs) have shown remarkable capabilities in a wide range of vision-language tasks. However, the large number of visual tokens introduces significant computational overhead. To address this issue, visual token pruning has emerged as a key technique for enhancing the efficiency of MLLMs. In cognitive science, humans tend to first determine which regions of a scene to attend to ("where to look") before deciding which specific elements within those regions to process in detail ("what to select"). This two-stage strategy enables the visual system to efficiently allocate attention at a coarse spatial level before performing fine-grained selection. However, existing pruning methods primarily focus on directly optimizing "what to select", typically using attention scores or similarity metrics. They rarely consider "where to look", which has been shown to lead to inefficient spatial allocation, positional bias, and the retention of irrelevant or redundant tokens. In this paper, we propose GridPrune, a method that replaces the global Top-K mechanism with a "guide-globally, select-locally" zonal selection system. GridPrune splits the pruning process into two steps: first, it uses text-conditional guidance to dynamically allocate a token budget across spatial zones; and then, it performs local selection within each budgeted zone. Experimental results demonstrate that GridPrune achieves superior performance across various MLLM architectures. On LLaVA-NeXT-7B, GridPrune retains 96.98% of the full performance while using 11.1% of the tokens, outperforming the best-performing baseline by 2.34% at the same pruning rate.

Paper Structure

This paper contains 15 sections, 6 equations, 5 figures, 7 tables, 2 algorithms.

Figures (5)

  • Figure 1: Performance comparison of GridPrune against state-of-the-art methods across various MLLM architectures. (a) presents results on the high-resolution LLaVA-NeXT-7B, with 11.1% of visual tokens retained. (b) shows the average performance trend on Qwen2.5-VL-7B as the token retention ratio varies.
  • Figure 2: Comparison of GridPrune with FastV. (a) In a direct comparison, FastV’s selection is guided by a positional bias towards final tokens, while GridPrune’s is guided by the query’s semantic content. (b) Statistical analysis on the MME benchmark at scale shows that the histogram of selected indices exhibits a massive spike for FastV at the end of the sequence, revealing a strong positional bias. This is inefficient, as important content in images is typically centered or evenly distributed, rather than confined to one corner. In contrast, GridPrune's distribution is more balanced.
  • Figure 3: An overview of the GridPrune method. We first calculate two scores for each visual token: (a) Text-Conditional Relevance, derived from the cosine similarity between token embeddings and the text embedding (obtained from the CLIP text encoder using the user's prompt as input), and (b) Intrinsic Visual Saliency, extracted from the vision encoder's attention matrix. These are combined into (c) Fused Importance Score via $\alpha$. GridPrune follows a "guide-globally, select-locally" process: (1) the tokens are partitioned into zones; (2) a token budget is dynamically allocated to these zones based on their aggregate text-conditional relevance; and (3) a local Top-K selection is performed within each zone using the fused importance score to select the final token set. This mechanism ensures a selection that is both query-aware and spatially balanced.
  • Figure 4: The processing flow of GridPrune applied to LLaVA-NeXT. The image is first dynamically cropped into multiple sub-images, and each sub-image independently passes through the vision encoder. Then, guided by the instruction, GridPrune prunes the tokens of each sub-image separately. Finally, all the retained tokens are projected and concatenated to form the final sequence.
  • Figure 5: Visualization of GridPrune's token selection. The token focus dynamically shifts to the calendar, cup, or bird based on the user’s query, even when the number of reserved tokens is limited to 12.