ELIP: Efficient Discriminative Language-Image Pre-training with Fewer Vision Tokens
Yangyang Guo, Haoyu Zhang, Yongkang Wong, Liqiang Nie, Mohan Kankanhalli
TL;DR
ELIP tackles the computational burden of discriminative language–image pre-training by pruning vision tokens in a guided, text-supervised fashion. It employs a progressive four-block ViT with token merging, where a fusion of vision and text CLS features guides which tokens to retain, reducing tokens by ~30% with an average downstream accuracy drop around $0.32$ points. Across multiple backbones (e.g., ALBEF, BLIP, METER) and tasks (retrieval, VQA, VE, NLVR^2, captioning), ELIP achieves favorable efficiency–effectiveness trade-offs, enabling larger pretraining batch sizes and faster turns without extra trainable parameters. The results suggest substantial practical benefits for scalable, resource-efficient multimodal pre-training and offer a foundation for future adaptive pruning and integration with other efficiency techniques.
Abstract
Learning a versatile language-image model is computationally prohibitive under a limited computing budget. This paper delves into the \emph{efficient language-image pre-training}, an area that has received relatively little attention despite its importance in reducing computational cost and footprint. To that end, we propose a vision token pruning and merging method ELIP, to remove less influential tokens based on the supervision of language outputs. Our method is designed with several strengths, such as being computation-efficient, memory-efficient, and trainable-parameter-free, and is distinguished from previous vision-only token pruning approaches by its alignment with task objectives. We implement this method in a progressively pruning manner using several sequential blocks. To evaluate its generalization performance, we apply ELIP to three commonly used language-image pre-training models and utilize public image-caption pairs with 4M images for pre-training. Our experiments demonstrate that with the removal of ~30$\%$ vision tokens across 12 ViT layers, ELIP maintains significantly comparable performance with baselines ($\sim$0.32 accuracy drop on average) over various downstream tasks including cross-modal retrieval, VQA, image captioning, \emph{etc}. In addition, the spared GPU resources by our ELIP allow us to scale up with larger batch sizes, thereby accelerating model pre-training and even sometimes enhancing downstream model performance.
