Table of Contents
Fetching ...

MoPE-CLIP: Structured Pruning for Efficient Vision-Language Models with Module-wise Pruning Error Metric

Haokun Lin, Haoli Bai, Zhili Liu, Lu Hou, Muyi Sun, Linqi Song, Ying Wei, Zhenan Sun

TL;DR

This paper first proposes the Module-wise Pruning Error (MoPE) met-ric, accurately assessing CLIP module importance by performance decline on cross-modal tasks, and introduces a unified pruning framework applica-ble to both pretraining and task-specific fine-tuning compression stages.

Abstract

Vision-language pre-trained models have achieved impressive performance on various downstream tasks. However, their large model sizes hinder their utilization on platforms with limited computational resources. We find that directly using smaller pre-trained models and applying magnitude-based pruning on CLIP models leads to inflexibility and inferior performance. Recent efforts for VLP compression either adopt uni-modal compression metrics resulting in limited performance or involve costly mask-search processes with learnable masks. In this paper, we first propose the Module-wise Pruning Error (MoPE) metric, accurately assessing CLIP module importance by performance decline on cross-modal tasks. Using the MoPE metric, we introduce a unified pruning framework applicable to both pre-training and task-specific fine-tuning compression stages. For pre-training, MoPE-CLIP effectively leverages knowledge from the teacher model, significantly reducing pre-training costs while maintaining strong zero-shot capabilities. For fine-tuning, consecutive pruning from width to depth yields highly competitive task-specific models. Extensive experiments in two stages demonstrate the effectiveness of the MoPE metric, and MoPE-CLIP outperforms previous state-of-the-art VLP compression methods.

MoPE-CLIP: Structured Pruning for Efficient Vision-Language Models with Module-wise Pruning Error Metric

TL;DR

This paper first proposes the Module-wise Pruning Error (MoPE) met-ric, accurately assessing CLIP module importance by performance decline on cross-modal tasks, and introduces a unified pruning framework applica-ble to both pretraining and task-specific fine-tuning compression stages.

Abstract

Vision-language pre-trained models have achieved impressive performance on various downstream tasks. However, their large model sizes hinder their utilization on platforms with limited computational resources. We find that directly using smaller pre-trained models and applying magnitude-based pruning on CLIP models leads to inflexibility and inferior performance. Recent efforts for VLP compression either adopt uni-modal compression metrics resulting in limited performance or involve costly mask-search processes with learnable masks. In this paper, we first propose the Module-wise Pruning Error (MoPE) metric, accurately assessing CLIP module importance by performance decline on cross-modal tasks. Using the MoPE metric, we introduce a unified pruning framework applicable to both pre-training and task-specific fine-tuning compression stages. For pre-training, MoPE-CLIP effectively leverages knowledge from the teacher model, significantly reducing pre-training costs while maintaining strong zero-shot capabilities. For fine-tuning, consecutive pruning from width to depth yields highly competitive task-specific models. Extensive experiments in two stages demonstrate the effectiveness of the MoPE metric, and MoPE-CLIP outperforms previous state-of-the-art VLP compression methods.
Paper Structure (57 sections, 7 equations, 10 figures, 17 tables, 2 algorithms)

This paper contains 57 sections, 7 equations, 10 figures, 17 tables, 2 algorithms.

Figures (10)

  • Figure 1: Empirical comparison between (a) the original large CLIP model and three smaller models with compressed vision encoders, including (b) a pre-trained small CLIP Model; (c) a small model obtained by substituting the original vision encoder in (a) with the small vision encoder of (b); and (d) a small model with the vision encoder pruned from (a). We perform pruning during pre-training or fine-tuning, evaluated (1) after fine-tuning or (2) with zero shot. Note that we train the substituted encoder $E^s_v$ in (c) and $E_v^p$ in (d) with image-text contrastive loss $\mathcal{L}_{itc}$. TR and IR stand for image-to-text and text-to-image retrieval, respectively.
  • Figure 2: The overall workflow of training MoPE-CLIP. (a). During the fine-tuning stage, we apply width-first-then-depth pruning on fine-tuned CLIP vision or text encoder to obtain powerful task-specific models. (b). An illustration of our distillation process, transferring cross-modal and uni-modal knowledge. (c). During the pre-training stage, we apply consecutive pruning in the width and depth directions on zero-shot CLIP encoders. (d). An illustration of MoPE metric, measuring the performance drop of CLIP after removing the module $\theta$.
  • Figure 3: Image-to-text retrieval results of three small model architectures on MSCOCO dataset. All models are trained with distillation.
  • Figure 4: Text-to-image retrieval results of three small model architectures on MSCOCO dataset. All models are trained with distillation.
  • Figure 5: Comparsion of training efficiency. EfficientVLM and TinyCLIP are trained for 25 epochs, while MoPE-CLIP$_{large}$ is trained for 20 epochs. All models are compressed from CLIP-ViT-L/14 at a 2x compression ratio and trained on the CC3M dataset.
  • ...and 5 more figures