Table of Contents
Fetching ...

Sparsity Meets Similarity: Leveraging Long-Tail Distribution for Dynamic Optimized Token Representation in Multimodal Large Language Models

Gaotong Yu, Yi Chen, Jian Xu

TL;DR

This work targets the high computational cost of multimodal LLMs by exploiting a long-tail distribution in CLS-to-visual token similarity to enable dynamic, sample-specific pruning of visual tokens before the LLM. It introduces a three-stage pipeline: (1) a dynamic segmentation that preserves the head of the CLS–visual similarity distribution, (2) projection and concatenation of retained visuals with text tokens, and (3) a cross-modal interactive pruning that further reduces input length at the LLM by considering visual–text relevance. Across multiple benchmarks, the method achieves up to 8× compression with minimal accuracy loss, and 22% on-average token usage in training-free settings (with further gains under fine-tuning), demonstrating practical efficiency gains for MM-LLMs. The approach provides a scalable, hardware-friendly path to accelerate multimodal reasoning without substantial performance sacrifices, by aligning token representations with cross-modal relevance and per-sample dynamics.

Abstract

Recently, multimodal large language models (MM-LLMs) have achieved significant success in various tasks, but their high computational costs limit widespread application. The main computational burden arises from processing concatenated text and visual tokens in the LLM layer, where input token length directly affects efficiency. Our analysis of visual tokens reveals that their similarity to the CLS token follows a long-tail distribution, with only a few showing high similarity. To address this, we propose a dynamic pruning algorithm that identifies the inflection point in the visual CLS token similarity curve, enabling effective trimming of visual markers to accelerate model performance. Additionally, we perform a second round of pruning in the LLM layer, filtering out low-correlation tokens through the interaction between visual and textual features. Experimental results demonstrate that our method achieves performance comparable to the original while utilizing only 22% of the original token quantity. Our source code will be made publicly available upon acceptance.

Sparsity Meets Similarity: Leveraging Long-Tail Distribution for Dynamic Optimized Token Representation in Multimodal Large Language Models

TL;DR

This work targets the high computational cost of multimodal LLMs by exploiting a long-tail distribution in CLS-to-visual token similarity to enable dynamic, sample-specific pruning of visual tokens before the LLM. It introduces a three-stage pipeline: (1) a dynamic segmentation that preserves the head of the CLS–visual similarity distribution, (2) projection and concatenation of retained visuals with text tokens, and (3) a cross-modal interactive pruning that further reduces input length at the LLM by considering visual–text relevance. Across multiple benchmarks, the method achieves up to 8× compression with minimal accuracy loss, and 22% on-average token usage in training-free settings (with further gains under fine-tuning), demonstrating practical efficiency gains for MM-LLMs. The approach provides a scalable, hardware-friendly path to accelerate multimodal reasoning without substantial performance sacrifices, by aligning token representations with cross-modal relevance and per-sample dynamics.

Abstract

Recently, multimodal large language models (MM-LLMs) have achieved significant success in various tasks, but their high computational costs limit widespread application. The main computational burden arises from processing concatenated text and visual tokens in the LLM layer, where input token length directly affects efficiency. Our analysis of visual tokens reveals that their similarity to the CLS token follows a long-tail distribution, with only a few showing high similarity. To address this, we propose a dynamic pruning algorithm that identifies the inflection point in the visual CLS token similarity curve, enabling effective trimming of visual markers to accelerate model performance. Additionally, we perform a second round of pruning in the LLM layer, filtering out low-correlation tokens through the interaction between visual and textual features. Experimental results demonstrate that our method achieves performance comparable to the original while utilizing only 22% of the original token quantity. Our source code will be made publicly available upon acceptance.
Paper Structure (18 sections, 3 equations, 6 figures, 8 tables)

This paper contains 18 sections, 3 equations, 6 figures, 8 tables.

Figures (6)

  • Figure 1: Visual tokens have sparsity. Visualize CLS visual token similarity, with black areas representing the 'tails', and others representing the 'heads". The red boxes represent key areas of the entire image.
  • Figure 2: The framework of dynamic optimization algorithm for token compression targeting long-tail distribution, where the right side is the overall framework and the left side is the submodule framework diagram.
  • Figure 3: Example of converting token pruning task into long tail distribution segmentation task.
  • Figure 4: Example of visual token pruning with our method.
  • Figure 5: Example of visual-text interactive pruning.
  • ...and 1 more figures