Table of Contents
Fetching ...

Beyond Intermediate States: Explaining Visual Redundancy through Language

Dingchen Yang, Bowen Cao, Anran Zhang, Weibo Gu, Winston Hu, Guang Chen

TL;DR

The paper addresses the inefficiency of visual tokens in multi-modal large language models by moving beyond intermediate-state pruning to an input-output perspective. It introduces token-centric and context-centric analyses to quantify how each visual token contributes to final predictions, revealing that tokens with low ViT-[cls] similarity or low text-to-image attention can nonetheless carry meaningful information and influence surrounding context. Building on these insights, it proposes a training-free identify-then-probe strategy that constructs a redundancy codebook from training data and prunes tokens at inference by measuring similarity to redundant prototypes, achieving 90%–110% of peak performance while pruning 80%–90% of tokens across single-image, multi-image, and video tasks. The method consistently outperforms state-of-the-art intermediate-state–based pruning approaches and generalizes to diverse vision-language tasks, offering substantial efficiency gains without retraining. This work provides a practical, interpretable framework for reducing visual redundancy in MLLMs and highlights the nuanced role of individual visual tokens in visual understanding.

Abstract

Multi-modal Large Langue Models (MLLMs) often process thousands of visual tokens, which consume a significant portion of the context window and impose a substantial computational burden. Prior work has empirically explored visual token pruning methods based on MLLMs' intermediate states (e.g., attention scores). However, they have limitations in precisely defining visual redundancy due to their inability to capture the influence of visual tokens on MLLMs' visual understanding (i.e., the predicted probabilities for textual token candidates). To address this issue, we manipulate the visual input and investigate variations in the textual output from both token-centric and context-centric perspectives, achieving intuitive and comprehensive analysis. Experimental results reveal that visual tokens with low ViT-[cls] association and low text-to-image attention scores can contain recognizable information and significantly contribute to images' overall information. To develop a more reliable method for identifying and pruning redundant visual tokens, we integrate these two perspectives and introduce a context-independent condition to identify redundant prototypes from training images, which probes the redundancy of each visual token during inference. Extensive experiments on single-image, multi-image and video comprehension tasks demonstrate the effectiveness of our method, notably achieving 90% to 110% of the performance while pruning 80% to 90% of visual tokens.

Beyond Intermediate States: Explaining Visual Redundancy through Language

TL;DR

The paper addresses the inefficiency of visual tokens in multi-modal large language models by moving beyond intermediate-state pruning to an input-output perspective. It introduces token-centric and context-centric analyses to quantify how each visual token contributes to final predictions, revealing that tokens with low ViT-[cls] similarity or low text-to-image attention can nonetheless carry meaningful information and influence surrounding context. Building on these insights, it proposes a training-free identify-then-probe strategy that constructs a redundancy codebook from training data and prunes tokens at inference by measuring similarity to redundant prototypes, achieving 90%–110% of peak performance while pruning 80%–90% of tokens across single-image, multi-image, and video tasks. The method consistently outperforms state-of-the-art intermediate-state–based pruning approaches and generalizes to diverse vision-language tasks, offering substantial efficiency gains without retraining. This work provides a practical, interpretable framework for reducing visual redundancy in MLLMs and highlights the nuanced role of individual visual tokens in visual understanding.

Abstract

Multi-modal Large Langue Models (MLLMs) often process thousands of visual tokens, which consume a significant portion of the context window and impose a substantial computational burden. Prior work has empirically explored visual token pruning methods based on MLLMs' intermediate states (e.g., attention scores). However, they have limitations in precisely defining visual redundancy due to their inability to capture the influence of visual tokens on MLLMs' visual understanding (i.e., the predicted probabilities for textual token candidates). To address this issue, we manipulate the visual input and investigate variations in the textual output from both token-centric and context-centric perspectives, achieving intuitive and comprehensive analysis. Experimental results reveal that visual tokens with low ViT-[cls] association and low text-to-image attention scores can contain recognizable information and significantly contribute to images' overall information. To develop a more reliable method for identifying and pruning redundant visual tokens, we integrate these two perspectives and introduce a context-independent condition to identify redundant prototypes from training images, which probes the redundancy of each visual token during inference. Extensive experiments on single-image, multi-image and video comprehension tasks demonstrate the effectiveness of our method, notably achieving 90% to 110% of the performance while pruning 80% to 90% of visual tokens.

Paper Structure

This paper contains 39 sections, 13 equations, 14 figures, 3 tables, 2 algorithms.

Figures (14)

  • Figure 1: We investigate the inherent information encoded in individual visual tokens by instructing LLaVA-Next to describe them and analyzing the corresponding decoding results, predicted probabilities, and confidence scores (logits). "Patch #" indicates the index in the flattened patch sequence. Some visual tokens with low ViT$-[cls]$ similarity and low attention scores (e.g., Patch #114, #160, and #425) contain valid visual information (e.g., Carrot, Potato, and Spoon) that the model recognizes with high confidence (40% to 80% probability). Conversely, despite having high ViT$-[cls]$ similarity and high attention scores (highlighted in the red box), certain visual tokens yield text descriptions unrelated to the image patches (e.g., Cat and Tree), with model confidence lower than 10%.
  • Figure 2: Overview of our proposed visual redundancy analysis pipeline. In the single visual token input experiment, we provide a single visual token to the LLM and instruct it to describe the visual content. By analyzing the predicted probabilities, we assess the significance of the information encoded in each visual token. Next, we examine the influence of individual visual tokens on the broader visual context (image or image region) by measuring changes in the predicted probability distribution before and after ablating specific visual tokens. The region level leave-one-out experiment evaluates the influence of a single visual token (highlighted in red) on its neighboring image region, while the global level leave-one-out experiment assesses the impact of this region on the entire image. The results from these two experiments are combined to quantify the influence of individual visual tokens on the overall image representation.
  • Figure 3: Visual tokens with low ViT$-[cls]$ similarity and text-to-image attention scores can more significantly impact LLaVA-Next's understanding of the image, as patch #510 has higher JSD values than patch #523. candi. and diff. denote candidates and differences, respectively. Patch #510 primarily contributes the semantic information Soup to its neighboring region (+3.4844 confidence scores) and to the entire image (+0.2969 scores).
  • Figure 4: Quantitative results on 6,400 image patches sampled from the VQAv2 validation set. As the text-to-image attention score and the ViT$-[cls]$ similarity decrease, the top-1 probability and the Jenson-Shannon Divergence do not show a declining trend; instead, they fluctuate around 0.24 and 4e-3, respectively. The results are averaged across 100 image samples.
  • Figure 5: An overview of our identify-then-probe approach. We identify redundant prototypes from training images using single-input and cascaded leave-one-out experiments, and store them in a extensible codebook. During inference, visual tokens with higher similarity to these prototypes are considered more likely to be redundant and are removed before the first layer of the LLM. $L$ and $R$ are the number of input and retained visual tokens, respectively.
  • ...and 9 more figures