Table of Contents
Fetching ...

Boosting Multimodal Large Language Models with Visual Tokens Withdrawal for Rapid Inference

Zhihang Lin, Mingbao Lin, Luxi Lin, Rongrong Ji

TL;DR

This work tackles the high inference cost of multimodal large language models by identifying that vision tokens lose importance in deep layers due to attention sink and information migration. It introduces Visual Tokens Withdrawal (VTW), a plug-and-play method that withdraws vision tokens after a chosen layer, with the withdrawal layer selected via a KL-divergence criterion on a small subset. Empirical results across VQA, visual reasoning, video understanding, and downstream segmentation demonstrate that VTW reduces FLOPs by over 40% while maintaining performance and compatibility with existing KV Cache and Flash-attention mechanisms. The approach enables faster, scalable multimodal inference suitable for real-time chatbots and broad multimodal tasks, with clear ablations and future directions for training-time integration and cross-domain extension.

Abstract

Multimodal large language models (MLLMs) demand considerable computations for inference due to the extensive parameters and the additional input tokens needed for visual information representation. Herein, we introduce Visual Tokens Withdrawal (VTW), a plug-and-play module to boost MLLMs for rapid inference. Our approach is inspired by two intriguing phenomena we have observed: (1) the attention sink phenomenon that is prevalent in LLMs also persists in MLLMs, suggesting that initial tokens and nearest tokens receive the majority of attention, while middle vision tokens garner minimal attention in deep layers; (2) the presence of information migration, which implies that visual information is transferred to subsequent text tokens within the first few layers of MLLMs. As per our findings, we conclude that vision tokens are unnecessary in the deep layers of MLLMs. Thus, we strategically withdraw them at a certain layer, enabling only text tokens to engage in subsequent layers. To pinpoint the ideal layer for VTW, we initially analyze a limited set of tiny datasets and choose the first layer that meets the Kullback-Leibler divergence criterion. Our VTW approach can cut computational overhead by over 40\% across diverse multimodal tasks while maintaining performance.

Boosting Multimodal Large Language Models with Visual Tokens Withdrawal for Rapid Inference

TL;DR

This work tackles the high inference cost of multimodal large language models by identifying that vision tokens lose importance in deep layers due to attention sink and information migration. It introduces Visual Tokens Withdrawal (VTW), a plug-and-play method that withdraws vision tokens after a chosen layer, with the withdrawal layer selected via a KL-divergence criterion on a small subset. Empirical results across VQA, visual reasoning, video understanding, and downstream segmentation demonstrate that VTW reduces FLOPs by over 40% while maintaining performance and compatibility with existing KV Cache and Flash-attention mechanisms. The approach enables faster, scalable multimodal inference suitable for real-time chatbots and broad multimodal tasks, with clear ablations and future directions for training-time integration and cross-domain extension.

Abstract

Multimodal large language models (MLLMs) demand considerable computations for inference due to the extensive parameters and the additional input tokens needed for visual information representation. Herein, we introduce Visual Tokens Withdrawal (VTW), a plug-and-play module to boost MLLMs for rapid inference. Our approach is inspired by two intriguing phenomena we have observed: (1) the attention sink phenomenon that is prevalent in LLMs also persists in MLLMs, suggesting that initial tokens and nearest tokens receive the majority of attention, while middle vision tokens garner minimal attention in deep layers; (2) the presence of information migration, which implies that visual information is transferred to subsequent text tokens within the first few layers of MLLMs. As per our findings, we conclude that vision tokens are unnecessary in the deep layers of MLLMs. Thus, we strategically withdraw them at a certain layer, enabling only text tokens to engage in subsequent layers. To pinpoint the ideal layer for VTW, we initially analyze a limited set of tiny datasets and choose the first layer that meets the Kullback-Leibler divergence criterion. Our VTW approach can cut computational overhead by over 40\% across diverse multimodal tasks while maintaining performance.
Paper Structure (32 sections, 6 equations, 12 figures, 9 tables, 1 algorithm)

This paper contains 32 sections, 6 equations, 12 figures, 9 tables, 1 algorithm.

Figures (12)

  • Figure 1: (a) An instance from POPE pope2023kv-cache with red box indicating key area for answering the question. (b) In FastV, some genuinely important tokens like "bottle" are pruned, while unimportant tokens like "cake" are preserved.
  • Figure 2: The framework of our method. Vision tokens are withdrawn in the $K$-th layer of large language models.
  • Figure 3: The illustration for the input of a multimodal large language model. The input tokens are composed of system tokens, vision tokens, instruction tokens, and output tokens.
  • Figure 4: The output token's attention towards various input token types across different layers on a combined subset of AI2D kembhavi2016ai2d, MMMU_Val yue2023mmmu, MME fu2023mme, and POPE pope2023kv-cache (100 samples from each dataset). The attention values are averaged across all attention heads and output tokens.
  • Figure 5: The output token's attention towards various input token types across output tokens. Our visualization is conducted on a subset of AI2D kembhavi2016ai2d, MMMU_Val yue2023mmmu, MME fu2023mme, and POPE pope2023kv-cache (20 samples from each dataset). The attention is averaged across all attention heads and layers.
  • ...and 7 more figures