Table of Contents
Fetching ...

EvoPrune: Early-Stage Visual Token Pruning for Efficient MLLMs

Yuhao Chen, Bin Shan, Xin Ye, Cheng Chen

TL;DR

EvoPrune is proposed, an early-stage visual token pruning method for MLLMs that performs pruning directly during visual encoding that achieves 2$\times inference speedup with less than 1% performance degradation, demonstrating its potential for latency-sensitive MLLM deployment.

Abstract

Multimodal Large Language Models (MLLMs) have shown strong performance in vision-language tasks, but their inference efficiency is severely limited by the exponential growth of visual tokens in complex scenarios such as high-resolution images and videos. Existing visual token pruning methods mainly operate after visual encoding, overlooking the substantial computational cost incurred during the encoding stage. To address this issue, we propose EvoPrune, an early-stage visual token pruning method for MLLMs that performs pruning directly during visual encoding. Specifically, EvoPrune employs a layer-wise pruning strategy guided by token similarity, diversity, and attention-based importance to retain the most informative visual tokens at selected encoding layers. Extensive experiments on image and video benchmarks validate the effectiveness of EvoPrune. In particular, on the VideoMME dataset, EvoPrune achieves 2$\times$ inference speedup with less than 1% performance degradation, demonstrating its potential for latency-sensitive MLLM deployment.

EvoPrune: Early-Stage Visual Token Pruning for Efficient MLLMs

TL;DR

EvoPrune is proposed, an early-stage visual token pruning method for MLLMs that performs pruning directly during visual encoding that achieves 2$\times inference speedup with less than 1% performance degradation, demonstrating its potential for latency-sensitive MLLM deployment.

Abstract

Multimodal Large Language Models (MLLMs) have shown strong performance in vision-language tasks, but their inference efficiency is severely limited by the exponential growth of visual tokens in complex scenarios such as high-resolution images and videos. Existing visual token pruning methods mainly operate after visual encoding, overlooking the substantial computational cost incurred during the encoding stage. To address this issue, we propose EvoPrune, an early-stage visual token pruning method for MLLMs that performs pruning directly during visual encoding. Specifically, EvoPrune employs a layer-wise pruning strategy guided by token similarity, diversity, and attention-based importance to retain the most informative visual tokens at selected encoding layers. Extensive experiments on image and video benchmarks validate the effectiveness of EvoPrune. In particular, on the VideoMME dataset, EvoPrune achieves 2 inference speedup with less than 1% performance degradation, demonstrating its potential for latency-sensitive MLLM deployment.
Paper Structure (44 sections, 13 equations, 4 figures, 3 tables, 1 algorithm)

This paper contains 44 sections, 13 equations, 4 figures, 3 tables, 1 algorithm.

Figures (4)

  • Figure 1: Inference time of different components in MLLMs with and without visual Pruning under different input scales.
  • Figure 2: Overview of the EvoPrune framework. Token merging operations are applied at selected visual encoder layers. A composite score matrix integrating semantic similarity, diversity, and attention-based importance, guides token pair selection.
  • Figure 3: Breakdown of Time-To-First-Token (TTFT, 1 unit = 10 ms). The Visual Encoder, Other, and LLM Backbone correspond to the visual encoder, intermediate processing modules (e.g., pooling), and the language model backbone, respectively.
  • Figure 4: Performance comparison of different layer-wise merging strategies across video benchmarks We present the Accuracy (top row) and Encoding Time (bottom row) as functions of the retained token budget $B \in \{16, 32, 64\}$. Each curve represents a specific layer-wise allocation pattern. Specifically, we set the window size $N=13$ for First and Last (out of 26 total layers), and the growth/decay rate $\alpha=1$ for Increasing and Decreasing.