Table of Contents
Fetching ...

SCOPE: Saliency-Coverage Oriented Token Pruning for Efficient Multimodel LLMs

Jinhong Deng, Wen Li, Joey Tianyi Zhou, Yang He

TL;DR

SCOPE tackles the inefficiency of multimodal LLMs caused by abundant visual tokens by jointly optimizing token saliency and semantic coverage. It defines a set-coverage objective and a token-coverage gain, combining them into a SCOPE score that greedily selects tokens to preserve semantic richness while reducing compute. Empirical results on LLaVA-1.5 and LLaVA-Next show large token reductions with minimal or even improved task performance, across image and video benchmarks, all in a train-free setup. This approach offers practical, scalable speedups for vision-language models without sacrificing accuracy, with broad applicability to edge and real-time settings.

Abstract

Multimodal Large Language Models (MLLMs) typically process a large number of visual tokens, leading to considerable computational overhead, even though many of these tokens are redundant. Existing visual token pruning methods primarily focus on selecting the most salient tokens based on attention scores, resulting in the semantic incompleteness of the selected tokens. In this paper, we propose a novel visual token pruning strategy, called \textbf{S}aliency-\textbf{C}overage \textbf{O}riented token \textbf{P}runing for \textbf{E}fficient MLLMs (SCOPE), to jointly model both the saliency and coverage of the selected visual tokens to better preserve semantic completeness. Specifically, we introduce a set-coverage for a given set of selected tokens, computed based on the token relationships. We then define a token-coverage gain for each unselected token, quantifying how much additional coverage would be obtained by including it. By integrating the saliency score into the token-coverage gain, we propose our SCOPE score and iteratively select the token with the highest SCOPE score. We conduct extensive experiments on multiple vision-language understanding benchmarks using the LLaVA-1.5 and LLaVA-Next models. Experimental results demonstrate that our method consistently outperforms prior approaches. Our code is available at \href{https://github.com/kinredon/SCOPE}{https://github.com/kinredon/SCOPE}.

SCOPE: Saliency-Coverage Oriented Token Pruning for Efficient Multimodel LLMs

TL;DR

SCOPE tackles the inefficiency of multimodal LLMs caused by abundant visual tokens by jointly optimizing token saliency and semantic coverage. It defines a set-coverage objective and a token-coverage gain, combining them into a SCOPE score that greedily selects tokens to preserve semantic richness while reducing compute. Empirical results on LLaVA-1.5 and LLaVA-Next show large token reductions with minimal or even improved task performance, across image and video benchmarks, all in a train-free setup. This approach offers practical, scalable speedups for vision-language models without sacrificing accuracy, with broad applicability to edge and real-time settings.

Abstract

Multimodal Large Language Models (MLLMs) typically process a large number of visual tokens, leading to considerable computational overhead, even though many of these tokens are redundant. Existing visual token pruning methods primarily focus on selecting the most salient tokens based on attention scores, resulting in the semantic incompleteness of the selected tokens. In this paper, we propose a novel visual token pruning strategy, called \textbf{S}aliency-\textbf{C}overage \textbf{O}riented token \textbf{P}runing for \textbf{E}fficient MLLMs (SCOPE), to jointly model both the saliency and coverage of the selected visual tokens to better preserve semantic completeness. Specifically, we introduce a set-coverage for a given set of selected tokens, computed based on the token relationships. We then define a token-coverage gain for each unselected token, quantifying how much additional coverage would be obtained by including it. By integrating the saliency score into the token-coverage gain, we propose our SCOPE score and iteratively select the token with the highest SCOPE score. We conduct extensive experiments on multiple vision-language understanding benchmarks using the LLaVA-1.5 and LLaVA-Next models. Experimental results demonstrate that our method consistently outperforms prior approaches. Our code is available at \href{https://github.com/kinredon/SCOPE}{https://github.com/kinredon/SCOPE}.

Paper Structure

This paper contains 22 sections, 10 equations, 8 figures, 9 tables, 1 algorithm.

Figures (8)

  • Figure 1: (a) Semantic Completeness Analysis. We visualize the selected tokens using a saliency-based rule (Top) and our method (Bottom). The saliency score corresponds to the visual attention assigned to the CLS token. Our method selects tokens that maximize coverage while preserving the most dominant visual information. (b) Skewed Attention Distribution. We show the averaged attention distribution of the top 128 tokens on the MME benchmark. The attention weights rapidly flatten, making tail tokens less distinguishable based on their attention values. (c) Performance comparison with prior methods across various benchmarks. The model is LLaVA-1.5 7B, and the number of retained tokens is 64.
  • Figure 2: Comparison of $\theta$-coverage across different token pruning criteria. The experiments are conducted on the MME benchmark, with 64 tokens selected out of the original 576 in LLaVA 1.5 7B.
  • Figure 3: An overview of the proposed visual token pruning framework. The left part illustrates how our method reduces the number of visual tokens before feeding them into the LLM, thereby accelerating inference in MLLMs without requiring additional model training. The right part provides a detailed view of our SCOPE method, which jointly optimizes saliency and coverage to select a compact yet semantically representative subset of visual tokens.
  • Figure 4: The performance comparison under extreme token number.
  • Figure 5:
  • ...and 3 more figures

Theorems & Definitions (1)

  • Definition 1: $\theta$-Coverage