Table of Contents
Fetching ...

Patch Ranking: Efficient CLIP by Learning to Rank Local Patches

Cheng-En Wu, Jinhong Lin, Yu Hen Hu, Pedro Morgado

TL;DR

This paper tackles the high computational cost of CLIP’s ViT backbone by introducing a Golden Ranking for patch-token importance, alongside a lightweight predictor to approximate this ranking during inference. A three-phase pipeline—Golden Ranking establishment, predictor-based pruning, and learnable-token-based model tuning—yields up to 40% token reduction with only ~0.3 percentage-point accuracy loss across seven datasets. Learnable text and visual tokens further mitigate pruning-induced performance degradation, enabling robust zero-shot and few-shot operation with reduced compute. Empirically, the approach outperforms CLS-attention pruning and other token-pruning baselines, achieving significant GFLOPs and latency savings while maintaining competitive accuracy in multimodal recognition tasks.

Abstract

Contrastive image-text pre-trained models such as CLIP have shown remarkable adaptability to downstream tasks. However, they face challenges due to the high computational requirements of the Vision Transformer (ViT) backbone. Current strategies to boost ViT efficiency focus on pruning patch tokens but fall short in addressing the multimodal nature of CLIP and identifying the optimal subset of tokens for maximum performance. To address this, we propose greedy search methods to establish a "Golden Ranking" and introduce a lightweight predictor specifically trained to approximate this Ranking. To compensate for any performance degradation resulting from token pruning, we incorporate learnable visual tokens that aid in restoring and potentially enhancing the model's performance. Our work presents a comprehensive and systematic investigation of pruning tokens within the ViT backbone of CLIP models. Through our framework, we successfully reduced 40% of patch tokens in CLIP's ViT while only suffering a minimal average accuracy loss of 0.3 across seven datasets. Our study lays the groundwork for building more computationally efficient multimodal models without sacrificing their performance, addressing a key challenge in the application of advanced vision-language models.

Patch Ranking: Efficient CLIP by Learning to Rank Local Patches

TL;DR

This paper tackles the high computational cost of CLIP’s ViT backbone by introducing a Golden Ranking for patch-token importance, alongside a lightweight predictor to approximate this ranking during inference. A three-phase pipeline—Golden Ranking establishment, predictor-based pruning, and learnable-token-based model tuning—yields up to 40% token reduction with only ~0.3 percentage-point accuracy loss across seven datasets. Learnable text and visual tokens further mitigate pruning-induced performance degradation, enabling robust zero-shot and few-shot operation with reduced compute. Empirically, the approach outperforms CLS-attention pruning and other token-pruning baselines, achieving significant GFLOPs and latency savings while maintaining competitive accuracy in multimodal recognition tasks.

Abstract

Contrastive image-text pre-trained models such as CLIP have shown remarkable adaptability to downstream tasks. However, they face challenges due to the high computational requirements of the Vision Transformer (ViT) backbone. Current strategies to boost ViT efficiency focus on pruning patch tokens but fall short in addressing the multimodal nature of CLIP and identifying the optimal subset of tokens for maximum performance. To address this, we propose greedy search methods to establish a "Golden Ranking" and introduce a lightweight predictor specifically trained to approximate this Ranking. To compensate for any performance degradation resulting from token pruning, we incorporate learnable visual tokens that aid in restoring and potentially enhancing the model's performance. Our work presents a comprehensive and systematic investigation of pruning tokens within the ViT backbone of CLIP models. Through our framework, we successfully reduced 40% of patch tokens in CLIP's ViT while only suffering a minimal average accuracy loss of 0.3 across seven datasets. Our study lays the groundwork for building more computationally efficient multimodal models without sacrificing their performance, addressing a key challenge in the application of advanced vision-language models.
Paper Structure (30 sections, 3 equations, 5 figures, 7 tables)

This paper contains 30 sections, 3 equations, 5 figures, 7 tables.

Figures (5)

  • Figure 1: Accuracy vs. complexity for various token pruning strategies in pre-trained CLIP models is evaluated on Caltech101. Six points represent models with token keep-rates from 100% to 50%. The CLS Attention method prunes image patches by measuring similarity between CLS tokens and others in the 4th layer of CLIP's ViT. Patch Ranking, using our Preservation-based ranking strategy, outperforms the traditional CLS method. Patch Ranking w/ T Prompt Tuning and Patch Ranking w/ V+T Prompt Tuning extend this by adding learnable tokens to the Text Encoder or both the ViT and Text Encoder, fine-tuned with 16 shots per class. Prompt-tuning boosts performance, and tuning both prompts (green line) shows no significant degradation up to a 50% keep rate.
  • Figure 2: This diagram presents an overview of our pruning framework for patch tokens in CLIP's ViT. The framework comprises three main phases: (a) Phase I: Establishing a Golden Ranking, which involves assigning scores to each token based on their importance, as discussed in Section \ref{['sec:golden_ranking']}; (b) Phase II: Predicting the Golden Rankin, which focuses on training a predictor to approximate the Golden Ranking, as elaborated in Section \ref{['sec:predictor']}; and (c) Phase III: Model Tuning through Learnable Tokens, a process where additional visual learnable tokens are added to mitigate accuracy loss resulting from the removal of patch tokens, detailed in Section \ref{['sec:prompt_tuning']}.
  • Figure 3: Visualization of Scoring Functions for Patch Token Pruning: The scoring functions for patch token pruning are visualized as follows: Top row -- Label-Driven Ranking Score, middle row -- Maximum Confidence Score, and bottom row -- Feature Preservation Score.
  • Figure 4: This figure compares token pruning methods at the 50% keep rate: the middle column shows CLS attention weight-based pruning, and the right column features our Feature Preservation Score method.
  • Figure 5: This figure compares the classification accuracy between the CLS attention method and our Patch Ranking approach, both without fine-tuning post-token pruning. CLS attention employs CLS attention weights to rank tokens, whereas Patch Ranking utilizes the Feature Preservation Score for this purpose. Token removal occurs at the first layer of CLIP's ViT. We present classification accuracy across different keep rates, ranging from 100% to 50%, highlighting the differential impact of each method on model performance as the number of pruned tokens increases.