EvoPrune: Early-Stage Visual Token Pruning for Efficient MLLMs

Yuhao Chen; Bin Shan; Xin Ye; Cheng Chen

EvoPrune: Early-Stage Visual Token Pruning for Efficient MLLMs

Yuhao Chen, Bin Shan, Xin Ye, Cheng Chen

TL;DR

EvoPrune is proposed, an early-stage visual token pruning method for MLLMs that performs pruning directly during visual encoding that achieves 2$\times inference speedup with less than 1% performance degradation, demonstrating its potential for latency-sensitive MLLM deployment.

Abstract

Multimodal Large Language Models (MLLMs) have shown strong performance in vision-language tasks, but their inference efficiency is severely limited by the exponential growth of visual tokens in complex scenarios such as high-resolution images and videos. Existing visual token pruning methods mainly operate after visual encoding, overlooking the substantial computational cost incurred during the encoding stage. To address this issue, we propose EvoPrune, an early-stage visual token pruning method for MLLMs that performs pruning directly during visual encoding. Specifically, EvoPrune employs a layer-wise pruning strategy guided by token similarity, diversity, and attention-based importance to retain the most informative visual tokens at selected encoding layers. Extensive experiments on image and video benchmarks validate the effectiveness of EvoPrune. In particular, on the VideoMME dataset, EvoPrune achieves 2$\times$ inference speedup with less than 1% performance degradation, demonstrating its potential for latency-sensitive MLLM deployment.

EvoPrune: Early-Stage Visual Token Pruning for Efficient MLLMs

TL;DR

Abstract

inference speedup with less than 1% performance degradation, demonstrating its potential for latency-sensitive MLLM deployment.

Paper Structure (44 sections, 13 equations, 4 figures, 3 tables, 1 algorithm)

This paper contains 44 sections, 13 equations, 4 figures, 3 tables, 1 algorithm.

Introduction
Related work
Multimodal Large Language models
Visual Token Pruning
Method
Overall
Layer-wise Pruning Budget Allocation
Merging Strategy Design
Score-Guided Token Merging
Similarity, Diversity, and Attention Integration
Similarity Attraction
Diversity Penalty
Attention Preservation
Experiment
Experimental Settings
...and 29 more sections

Figures (4)

Figure 1: Inference time of different components in MLLMs with and without visual Pruning under different input scales.
Figure 2: Overview of the EvoPrune framework. Token merging operations are applied at selected visual encoder layers. A composite score matrix integrating semantic similarity, diversity, and attention-based importance, guides token pair selection.
Figure 3: Breakdown of Time-To-First-Token (TTFT, 1 unit = 10 ms). The Visual Encoder, Other, and LLM Backbone correspond to the visual encoder, intermediate processing modules (e.g., pooling), and the language model backbone, respectively.
Figure 4: Performance comparison of different layer-wise merging strategies across video benchmarks We present the Accuracy (top row) and Encoding Time (bottom row) as functions of the retained token budget $B \in \{16, 32, 64\}$. Each curve represents a specific layer-wise allocation pattern. Specifically, we set the window size $N=13$ for First and Last (out of 26 total layers), and the growth/decay rate $\alpha=1$ for Increasing and Decreasing.

EvoPrune: Early-Stage Visual Token Pruning for Efficient MLLMs

TL;DR

Abstract

EvoPrune: Early-Stage Visual Token Pruning for Efficient MLLMs

Authors

TL;DR

Abstract

Table of Contents

Figures (4)