Table of Contents
Fetching ...

PACT: Pruning and Clustering-Based Token Reduction for Faster Visual Language Models

Mohamed Dhouib, Davide Buscaldi, Sonia Vanier, Aymen Shabou

TL;DR

Visual Language Models incur high inference costs due to large visual token counts. PACT introduces Efficient Unimportant Tokens Identification (EUTI) to prune tokens and Distance Bounded Density Peak Clustering (DBDPC) to merge redundant tokens at an early layer $L$, with a token-recovery step for near-centers, all without extra training. Across diverse models and datasets, PACT achieves up to $71.3\%$ visual token reduction with only $1.4\%$ accuracy loss and significant speedups, outperforming existing token-reduction methods while remaining compatible with FlashAttention. The approach is architecture-agnostic and training-free, enabling broad applicability in VLMs and multi-turn visual dialogue, and introduces concrete algorithms that address both token irrelevance and redundancy.

Abstract

Visual Language Models require substantial computational resources for inference due to the additional input tokens needed to represent visual information. However, these visual tokens often contain redundant and unimportant information, resulting in an unnecessarily high number of tokens. To address this, we introduce PACT, a method that reduces inference time and memory usage by pruning irrelevant tokens and merging visually redundant ones at an early layer of the language model. Our approach uses a novel importance metric to identify unimportant tokens without relying on attention scores, making it compatible with FlashAttention. We also propose a novel clustering algorithm, called Distance Bounded Density Peak Clustering, which efficiently clusters visual tokens while constraining the distances between elements within a cluster by a predefined threshold. We demonstrate the effectiveness of PACT through extensive experiments.

PACT: Pruning and Clustering-Based Token Reduction for Faster Visual Language Models

TL;DR

Visual Language Models incur high inference costs due to large visual token counts. PACT introduces Efficient Unimportant Tokens Identification (EUTI) to prune tokens and Distance Bounded Density Peak Clustering (DBDPC) to merge redundant tokens at an early layer , with a token-recovery step for near-centers, all without extra training. Across diverse models and datasets, PACT achieves up to visual token reduction with only accuracy loss and significant speedups, outperforming existing token-reduction methods while remaining compatible with FlashAttention. The approach is architecture-agnostic and training-free, enabling broad applicability in VLMs and multi-turn visual dialogue, and introduces concrete algorithms that address both token irrelevance and redundancy.

Abstract

Visual Language Models require substantial computational resources for inference due to the additional input tokens needed to represent visual information. However, these visual tokens often contain redundant and unimportant information, resulting in an unnecessarily high number of tokens. To address this, we introduce PACT, a method that reduces inference time and memory usage by pruning irrelevant tokens and merging visually redundant ones at an early layer of the language model. Our approach uses a novel importance metric to identify unimportant tokens without relying on attention scores, making it compatible with FlashAttention. We also propose a novel clustering algorithm, called Distance Bounded Density Peak Clustering, which efficiently clusters visual tokens while constraining the distances between elements within a cluster by a predefined threshold. We demonstrate the effectiveness of PACT through extensive experiments.

Paper Structure

This paper contains 28 sections, 19 equations, 17 figures, 15 tables, 4 algorithms.

Figures (17)

  • Figure 1: Simplified illustration of PACT. This figure illustrates the three-step process of PACT: (1) First, EUTI is used to prune visual tokens deemed unimportant; (2) Then, DBDPC is applied to cluster the remaining tokens, ensuring that the distance between each token and its corresponding cluster center is smaller than the cutoff distance; (3) Finally, initially pruned tokens that are close to cluster centers are reintegrated, and the elements within each cluster are merged to form the reduced set of visual tokens.
  • Figure 3: Illustration of visual token norm statistics at the fourth layer of LLaVA-OneVision-7B.
  • Figure 4: Illustration of the maximum distance between the keys of visual tokens for the first 10 layers of LLaVA-OneVision-7B before the application of rotary embeddings.
  • Figure 5: Comparison between PACT, DBDPC, and EUTI against other visual token reduction methods across various reduction ratios applied on LLaVA-OneVision-7B.
  • Figure 6: Comparison between PACT and other visual token reduction methods across various reduction ratios applied on Qwen2-VL-7B-Instruct.
  • ...and 12 more figures