Table of Contents
Fetching ...

ELIP: Efficient Discriminative Language-Image Pre-training with Fewer Vision Tokens

Yangyang Guo, Haoyu Zhang, Yongkang Wong, Liqiang Nie, Mohan Kankanhalli

TL;DR

ELIP tackles the computational burden of discriminative language–image pre-training by pruning vision tokens in a guided, text-supervised fashion. It employs a progressive four-block ViT with token merging, where a fusion of vision and text CLS features guides which tokens to retain, reducing tokens by ~30% with an average downstream accuracy drop around $0.32$ points. Across multiple backbones (e.g., ALBEF, BLIP, METER) and tasks (retrieval, VQA, VE, NLVR^2, captioning), ELIP achieves favorable efficiency–effectiveness trade-offs, enabling larger pretraining batch sizes and faster turns without extra trainable parameters. The results suggest substantial practical benefits for scalable, resource-efficient multimodal pre-training and offer a foundation for future adaptive pruning and integration with other efficiency techniques.

Abstract

Learning a versatile language-image model is computationally prohibitive under a limited computing budget. This paper delves into the \emph{efficient language-image pre-training}, an area that has received relatively little attention despite its importance in reducing computational cost and footprint. To that end, we propose a vision token pruning and merging method ELIP, to remove less influential tokens based on the supervision of language outputs. Our method is designed with several strengths, such as being computation-efficient, memory-efficient, and trainable-parameter-free, and is distinguished from previous vision-only token pruning approaches by its alignment with task objectives. We implement this method in a progressively pruning manner using several sequential blocks. To evaluate its generalization performance, we apply ELIP to three commonly used language-image pre-training models and utilize public image-caption pairs with 4M images for pre-training. Our experiments demonstrate that with the removal of ~30$\%$ vision tokens across 12 ViT layers, ELIP maintains significantly comparable performance with baselines ($\sim$0.32 accuracy drop on average) over various downstream tasks including cross-modal retrieval, VQA, image captioning, \emph{etc}. In addition, the spared GPU resources by our ELIP allow us to scale up with larger batch sizes, thereby accelerating model pre-training and even sometimes enhancing downstream model performance.

ELIP: Efficient Discriminative Language-Image Pre-training with Fewer Vision Tokens

TL;DR

ELIP tackles the computational burden of discriminative language–image pre-training by pruning vision tokens in a guided, text-supervised fashion. It employs a progressive four-block ViT with token merging, where a fusion of vision and text CLS features guides which tokens to retain, reducing tokens by ~30% with an average downstream accuracy drop around points. Across multiple backbones (e.g., ALBEF, BLIP, METER) and tasks (retrieval, VQA, VE, NLVR^2, captioning), ELIP achieves favorable efficiency–effectiveness trade-offs, enabling larger pretraining batch sizes and faster turns without extra trainable parameters. The results suggest substantial practical benefits for scalable, resource-efficient multimodal pre-training and offer a foundation for future adaptive pruning and integration with other efficiency techniques.

Abstract

Learning a versatile language-image model is computationally prohibitive under a limited computing budget. This paper delves into the \emph{efficient language-image pre-training}, an area that has received relatively little attention despite its importance in reducing computational cost and footprint. To that end, we propose a vision token pruning and merging method ELIP, to remove less influential tokens based on the supervision of language outputs. Our method is designed with several strengths, such as being computation-efficient, memory-efficient, and trainable-parameter-free, and is distinguished from previous vision-only token pruning approaches by its alignment with task objectives. We implement this method in a progressively pruning manner using several sequential blocks. To evaluate its generalization performance, we apply ELIP to three commonly used language-image pre-training models and utilize public image-caption pairs with 4M images for pre-training. Our experiments demonstrate that with the removal of ~30 vision tokens across 12 ViT layers, ELIP maintains significantly comparable performance with baselines (0.32 accuracy drop on average) over various downstream tasks including cross-modal retrieval, VQA, image captioning, \emph{etc}. In addition, the spared GPU resources by our ELIP allow us to scale up with larger batch sizes, thereby accelerating model pre-training and even sometimes enhancing downstream model performance.
Paper Structure (35 sections, 9 equations, 4 figures, 10 tables, 1 algorithm)

This paper contains 35 sections, 9 equations, 4 figures, 10 tables, 1 algorithm.

Figures (4)

  • Figure 1: Visualization of attention map discrepancy between vision-only ViT and vision-language BLIP models and pipeline of our proposed method ELIP. (a) When presented with the same image, ViT and BLIP often see different regions, resulting in a large KL divergence of their attention maps. (b) ELIP achieves efficient discriminative language-image pre-training by pruning less important vision tokens supervised by the text.
  • Figure 2: Token similarity and attention maps across different ViT layers of BLIP blip, as well as the FLOPs proportion of different modules for three typical language-image pre-trained models. (a) The attention distribution over image tokens grows from uniform to concentrated with layers going deeper. Besides, the token similarity initially decreases but then significantly increases, indicating that more vision tokens become redundant. (b) Notably, the vision encoder (VE) accounts for the majority of the computational cost of language-image models (compared to the text encoder - TE and modal fusion - MF).
  • Figure 3: Component effect on the text retrieval performance over the Flickr30K dataset. Left: Performance comparison of pruning-only and pruning-then-merging approaches. Right: Performance change with respect to the feature combination coefficient parameter $\lambda$ in Eqn. \ref{['eqn:lambda']}.
  • Figure 4: Visualization of pruning results with respect to two ViT depths: 2 and 10. Note that the effective vision tokens are gradually decreased by our method. We omit the merged tokens and show only the attention maps of the remaining ones for a clear illustration.

Theorems & Definitions (2)

  • Remark 1
  • Remark 2