Table of Contents
Fetching ...

TokenCarve: Information-Preserving Visual Token Compression in Multimodal Large Language Models

Xudong Tan, Peng Ye, Chongjun Tu, Jianjian Cao, Yaoxin Yang, Lin Zhang, Dongzhan Zhou, Tao Chen

TL;DR

The paper addresses the computational burden of visual tokens in Multimodal LLMs by introducing TokenCarve, a training-free, two-stage token compression method that preserves information in the attention output matrix. The authors establish a key link between model performance and the information quantity, measured by the rank of the attention output, motivating a two-stage IPGS-guided pruning and merging process that retains critical information. TokenCarve achieves substantial efficiency gains, reducing visual tokens to 22.2% with only a 1.54% drop in accuracy, while delivering up to 1.23× faster inference and 64% lower KV cache usage across 11 datasets and two model scales. The work demonstrates robust performance improvements, especially on OCR-heavy tasks, and provides a practical plug-and-play solution with broad applicability to MLLMs.

Abstract

Multimodal Large Language Models (MLLMs) are becoming increasingly popular, while the high computational cost associated with multimodal data input, particularly from visual tokens, poses a significant challenge. Existing training-based token compression methods improve inference efficiency but require costly retraining, while training-free methods struggle to maintain performance when aggressively reducing token counts. In this study, we reveal that the performance degradation of MLLM closely correlates with the accelerated loss of information in the attention output matrix. This insight introduces a novel information-preserving perspective, making it possible to maintain performance even under extreme token compression. Based on this finding, we propose TokenCarve, a training-free, plug-and-play, two-stage token compression framework. The first stage employs an Information-Preservation-Guided Selection (IPGS) strategy to prune low-information tokens, while the second stage further leverages IPGS to guide token merging, minimizing information loss. Extensive experiments on 11 datasets and 2 model variants demonstrate the effectiveness of TokenCarve. It can even reduce the number of visual tokens to 22.2% of the original count, achieving a 1.23x speedup in inference, a 64% reduction in KV cache storage, and only a 1.54% drop in accuracy. Our code is available at https://github.com/ShawnTan86/TokenCarve.

TokenCarve: Information-Preserving Visual Token Compression in Multimodal Large Language Models

TL;DR

The paper addresses the computational burden of visual tokens in Multimodal LLMs by introducing TokenCarve, a training-free, two-stage token compression method that preserves information in the attention output matrix. The authors establish a key link between model performance and the information quantity, measured by the rank of the attention output, motivating a two-stage IPGS-guided pruning and merging process that retains critical information. TokenCarve achieves substantial efficiency gains, reducing visual tokens to 22.2% with only a 1.54% drop in accuracy, while delivering up to 1.23× faster inference and 64% lower KV cache usage across 11 datasets and two model scales. The work demonstrates robust performance improvements, especially on OCR-heavy tasks, and provides a practical plug-and-play solution with broad applicability to MLLMs.

Abstract

Multimodal Large Language Models (MLLMs) are becoming increasingly popular, while the high computational cost associated with multimodal data input, particularly from visual tokens, poses a significant challenge. Existing training-based token compression methods improve inference efficiency but require costly retraining, while training-free methods struggle to maintain performance when aggressively reducing token counts. In this study, we reveal that the performance degradation of MLLM closely correlates with the accelerated loss of information in the attention output matrix. This insight introduces a novel information-preserving perspective, making it possible to maintain performance even under extreme token compression. Based on this finding, we propose TokenCarve, a training-free, plug-and-play, two-stage token compression framework. The first stage employs an Information-Preservation-Guided Selection (IPGS) strategy to prune low-information tokens, while the second stage further leverages IPGS to guide token merging, minimizing information loss. Extensive experiments on 11 datasets and 2 model variants demonstrate the effectiveness of TokenCarve. It can even reduce the number of visual tokens to 22.2% of the original count, achieving a 1.23x speedup in inference, a 64% reduction in KV cache storage, and only a 1.54% drop in accuracy. Our code is available at https://github.com/ShawnTan86/TokenCarve.

Paper Structure

This paper contains 25 sections, 10 equations, 8 figures, 18 tables.

Figures (8)

  • Figure 1: (a) The radar chart illustrates the performance of TokenCarve on eight datasets when compressing the visual tokens of LLaVA1.5-7B from 576 to 192, showing that the overall performance remains close to the uncompressed version despite the significant token reduction; (b) Key performance indicators demonstrate that TokenCarve achieves a compression ratio of 77.8%, with an average performance drop of only 1.54%, a 1.23× inference speedup, and a 64% reduction in KV cache usage; (c) The visualization depicts token positions during the two-stage compression process, where gray tokens are pruned and pink tokens are merged. In this OCR task example, TokenCarve consistently focuses on the critical regions of the image containing the text "Hawaii" throughout all compression stages.
  • Figure 2: Key observation: Curves of MLLM Performance and Visual Token Information under Visual Token Compression. The blue curve represents the performance variation, while the orange solid curve depicts the change in visual token information (with the orange dashed line indicating the original trend). The purple vertical line marks the threshold at which performance exhibits a pronounced decline, coinciding with an inflection in the information curve.
  • Figure 3: The pipeline of the proposed TokenCarve framework. TokenCarve is integrated between the second and third layers of the LLaVA-1.5 model as a plug-and-play module, effectuating visual token compression during the prefilling stage. The upper panel illustrates TokenCarve’s integration with LLaVA-1.5, which conventionally comprises 36 system tokens, 576 visual tokens, and a variable number of prompt tokens. The lower panel (the blue region on the left) details the two-stage compression process: Carving Stage I employs the IPGS module to excise tokens with low contribution; Carving Stage II implements finer-grained token merging based on GSM, maximizing information retention. The IPGS module (right region of the lower panel) calculates each token’s information contribution score and attention score, then combines the two into a final ranking score (a higher slash count indicates a higher information contribution, and darker tokens signify higher attention). The GSM module (middle region of the lower panel) uses these IPGS scores to split tokens into a higher-scored Set A and a lower-scored Set B, then merges Set B tokens with their most similar counterparts in Set A based on cosine similarity (with deeper connection lines representing higher similarity).
  • Figure 4: Impact of weighting coefficient ($\lambda$) on model performance. The results show that extreme values of $\lambda$ at both ends lead to performance degradation, highlighting the importance of the combined score.
  • Figure 5: Impact of merge proportion ($\rho$) on model performance. The results indicate that TokenCarve remains robust across different values of $\rho$, with performance variations staying within 1.5%.
  • ...and 3 more figures