Table of Contents
Fetching ...

OTPrune: Distribution-Aligned Visual Token Pruning via Optimal Transport

Xiwen Chen, Wenhui Zhu, Gen Li, Xuanzhao Dong, Yujian Xiong, Hao Wang, Peijie Qiu, Qingquan Song, Zhipeng Wang, Shao Tang, Yalin Wang, Abolfazl Razi

TL;DR

This work proposes OTPrune, a training-free framework that formulates pruning as distribution alignment via optimal transport (OT), and derives a tractable submodular objective that enables efficient optimization, and theoretically proves its monotonicity and submodularity.

Abstract

Multi-modal large language models (MLLMs) achieve strong visual-language reasoning but suffer from high inference cost due to redundant visual tokens. Recent work explores visual token pruning to accelerate inference, while existing pruning methods overlook the underlying distributional structure of visual representations. We propose OTPrune, a training-free framework that formulates pruning as distribution alignment via optimal transport (OT). By minimizing the 2-Wasserstein distance between the full and pruned token distributions, OTPrune preserves both local diversity and global representativeness while reducing inference cost. Moreover, we derive a tractable submodular objective that enables efficient optimization, and theoretically prove its monotonicity and submodularity, providing a principled foundation for stable and efficient pruning. We further provide a comprehensive analysis that explains how distributional alignment contributes to stable and semantically faithful pruning. Comprehensive experiments on wider benchmarks demonstrate that OTPrune achieves superior performance-efficiency tradeoffs compared to state-of-the-art methods. The code is available at https://github.com/xiwenc1/OTPrune.

OTPrune: Distribution-Aligned Visual Token Pruning via Optimal Transport

TL;DR

This work proposes OTPrune, a training-free framework that formulates pruning as distribution alignment via optimal transport (OT), and derives a tractable submodular objective that enables efficient optimization, and theoretically proves its monotonicity and submodularity.

Abstract

Multi-modal large language models (MLLMs) achieve strong visual-language reasoning but suffer from high inference cost due to redundant visual tokens. Recent work explores visual token pruning to accelerate inference, while existing pruning methods overlook the underlying distributional structure of visual representations. We propose OTPrune, a training-free framework that formulates pruning as distribution alignment via optimal transport (OT). By minimizing the 2-Wasserstein distance between the full and pruned token distributions, OTPrune preserves both local diversity and global representativeness while reducing inference cost. Moreover, we derive a tractable submodular objective that enables efficient optimization, and theoretically prove its monotonicity and submodularity, providing a principled foundation for stable and efficient pruning. We further provide a comprehensive analysis that explains how distributional alignment contributes to stable and semantically faithful pruning. Comprehensive experiments on wider benchmarks demonstrate that OTPrune achieves superior performance-efficiency tradeoffs compared to state-of-the-art methods. The code is available at https://github.com/xiwenc1/OTPrune.
Paper Structure (20 sections, 1 theorem, 46 equations, 5 figures, 2 tables, 1 algorithm)

This paper contains 20 sections, 1 theorem, 46 equations, 5 figures, 2 tables, 1 algorithm.

Key Result

Lemma 1

(kulesza2012determinantalkrause2007nearbach2013learning) Let $\boldsymbol{A} \in \mathbb{R}^{m \times m}$ be a positive definite matrix, and for any subset $S \subseteq \{1,\dots,m\}$ let $\boldsymbol{A}_S$ denote the corresponding principal submatrix. Define the set function Then $h$ is a submodular set function.

Figures (5)

  • Figure 1: Overview of the proposed OTPrune framework. Given visual and textual inputs, a vision encoder with a projector produces image tokens that are fed into a large language model together with text tokens and reasoning prompts. Existing methods, such as DivPrune, select tokens by maximizing diversity, which may overlook global representativeness. In contrast, OTPrune formulates pruning as distribution alignment by minimizing the 2-Wasserstein (optimal transport) distance between the full and pruned token distributions, thereby preserving both local diversity and global structure for efficient and semantically faithful multimodal reasoning.
  • Figure 2: Correlation between OT distance and downstream performance. We evaluate several manually designed selection strategies (First-K, Last-K, Uniform, and Random) along with the SOTA method DivPrunedivprune. For each strategy, we compute the OT distance between the selected subset and the full token set and measure downstream performance across 11 multimodal benchmarks. Both OT distance (lower distance $\Rightarrow$ lower rank) and task performance (higher score $\Rightarrow$ lower rank) are ranked to compute Spearman’s correlation. A strong positive correlation confirms that smaller OT distance corresponds to higher performance.
  • Figure 3: Comparison with diversity-based methods using LLaVA 1.5-13B. We evaluate OTPrune, DivPrunedivprune, DPPkulesza2012determinantal, and Random sampling across 11 multimodal benchmarks under different pruning ratios $\{0.05,\,0.098,\,0.15,\,0.2\}$. We report both absolute performance and relative performance, defined as the ratio between pruned and original model performance (pruned/original). We also compute the OT distance between each subset and the full token set and visualize the relative distance normalized by OTPrune, i.e., $f(\mathcal{C}_{\text{method}})/f(\mathcal{C}_{\text{OTPrune}})$.
  • Figure 4: Relative OT distance comparison across pruning methods. We evaluate OTPrune, DivPrunedivprune, DPPkulesza2012determinantal, and Random sampling using LLaVA 1.5-13B across 11 multimodal benchmarks under different token ratios $\{0.05,\,0.098,\,0.15,\,0.2\}$. Each bar shows the relative OT distance normalized by OTPrune, i.e., $f(\mathcal{C}_{\text{method}})/f(\mathcal{C}_{\text{OTPrune}})$, where lower values indicate closer alignment with the original token distribution. OTPrune consistently achieves the smallest OT distance across all ratios, demonstrating superior distributional fidelity. The last panel reports the overall average across all 11 datasets.
  • Figure 5: Sensitivity analysis of the balancing coefficient $\gamma$.

Theorems & Definitions (4)

  • proof
  • definition 1: Submodularity
  • Lemma 1
  • proof