Table of Contents
Fetching ...

[CLS] Token Tells Everything Needed for Training-free Efficient MLLMs

Ao Wang, Fengyuan Sun, Hui Chen, Zijia Lin, Jungong Han, Guiguang Ding

TL;DR

This work tackles the efficiency challenge of Multimodal Large Language Models by introducing a training-free visual token pruning method, VTC-CLS. It identifies a perception bias in prior pruning approaches that rely on visual–prompt attention and instead leverages the CLS token’s attention in the visual encoder, aggregated across $K$ layers to preserve the top $U$ tokens. The approach demonstrates state-of-the-art performance and significant inference speedups across eight benchmarks, while maintaining most of the original model’s capabilities. Its plug-and-play, training-free nature offers practical benefits for deploying MLLMs in resource-constrained settings.

Abstract

Multimodal Large Language Models (MLLMs) have recently demonstrated strong performance across a wide range of vision-language tasks, garnering significant attention in the computer vision. However, their efficient deployment remains a substantial challenge due to high computational costs and memory requirements. Recognizing the redundancy of information within the vision modality, recent studies have explored methods for compressing visual tokens in MLLMs to enhance efficiency in a training-free manner. Despite their effectiveness, existing methods like Fast rely on the attention between visual tokens and prompt text tokens as the importance indicator, overlooking the relevance to response text and thus introducing perception bias. In this paper, we demonstrate that in MLLMs, the [CLS] token in the visual encoder inherently knows which visual tokens are important for MLLMs. Building on this prior, we introduce a simple yet effective method for train-free visual token compression, called VTC-CLS. Firstly, it leverages the attention score of the [CLS] token on visual tokens as an importance indicator for pruning visual tokens. Besides, we also explore ensembling the importance scores derived by the [CLS] token from different layers to capture the key visual information more comprehensively. Extensive experiments demonstrate that our VTC-CLS achieves the state-of-the-art performance across various tasks compared with baseline methods. It also brings notably less computational costs in a training-free manner, highlighting its effectiveness and superiority. Code and models are available at \url{https://github.com/THU-MIG/VTC-CLS}.

[CLS] Token Tells Everything Needed for Training-free Efficient MLLMs

TL;DR

This work tackles the efficiency challenge of Multimodal Large Language Models by introducing a training-free visual token pruning method, VTC-CLS. It identifies a perception bias in prior pruning approaches that rely on visual–prompt attention and instead leverages the CLS token’s attention in the visual encoder, aggregated across layers to preserve the top tokens. The approach demonstrates state-of-the-art performance and significant inference speedups across eight benchmarks, while maintaining most of the original model’s capabilities. Its plug-and-play, training-free nature offers practical benefits for deploying MLLMs in resource-constrained settings.

Abstract

Multimodal Large Language Models (MLLMs) have recently demonstrated strong performance across a wide range of vision-language tasks, garnering significant attention in the computer vision. However, their efficient deployment remains a substantial challenge due to high computational costs and memory requirements. Recognizing the redundancy of information within the vision modality, recent studies have explored methods for compressing visual tokens in MLLMs to enhance efficiency in a training-free manner. Despite their effectiveness, existing methods like Fast rely on the attention between visual tokens and prompt text tokens as the importance indicator, overlooking the relevance to response text and thus introducing perception bias. In this paper, we demonstrate that in MLLMs, the [CLS] token in the visual encoder inherently knows which visual tokens are important for MLLMs. Building on this prior, we introduce a simple yet effective method for train-free visual token compression, called VTC-CLS. Firstly, it leverages the attention score of the [CLS] token on visual tokens as an importance indicator for pruning visual tokens. Besides, we also explore ensembling the importance scores derived by the [CLS] token from different layers to capture the key visual information more comprehensively. Extensive experiments demonstrate that our VTC-CLS achieves the state-of-the-art performance across various tasks compared with baseline methods. It also brings notably less computational costs in a training-free manner, highlighting its effectiveness and superiority. Code and models are available at \url{https://github.com/THU-MIG/VTC-CLS}.

Paper Structure

This paper contains 12 sections, 9 equations, 4 figures, 6 tables.

Figures (4)

  • Figure 1: Top: the framework of FastV. It uses the attention in the LLM to perform token pruning, suffering from a perception bias that tokens that are relevant to the response text are inadvertently eliminated. Bottom: the framework of our VTC-CLS. We use [CLS] token to identify salient visual tokens, maintaining comprehensive visual information for response generation. Red lines means strong correlations. Orange and blue rectangles denotes the visual token and the text token, respectively.
  • Figure 2: (a) Overlap proportion of important visual tokens in the visual encoder and LLM's layers. We leverage the attention scores of [CLS] token and random values as the token importance score in visual encoder for inspection, denoted as "[CLS]" and "Random", respectively. Under the kept visual token number of 128 and 64, "[CLS]" can stably show the high overlap ratio, indicating the high consistency with the importance score in LLM. (b) The spearman's rank correlation coefficient of importance scores in the visual encoder and LLM's layers. (c) Average overlap ratio of important tokens and spearman's rank correlation coefficient of importance scores in the visual encoder and LLM's layers under different $K$. It shows that ensembling the importance across layers in visual encoder can strength its consistency with that of LLM. The best K is 3.
  • Figure 3: The pipeline of our method. Motivated by the high consistency between the important visual tokens in the visual encoder and those in LLM of MLLMs, we leverage the attention scores of the [CLS] token in the visual encoder as the importance indicator. We ensemble the importance scores across different layers of visual encoder for joint selection and reserve the critical ones needed by LLM.
  • Figure 4: Visualization of retained visual patches. The areas masked in black represent the discarded visual tokens. Besides, we further show the correspondence between the salient visual regions and texts by different colors. It can be observed that our method effectively removes redundant visual signals while preserving the significant ones, enabling various textual tokens to perceive the corresponding visual modality in LLM. Best viewed in color.