Enhancing Vision-Language Model Pre-training with Image-text Pair Pruning Based on Word Frequency

Mingliang Liang; Martha Larson

Enhancing Vision-Language Model Pre-training with Image-text Pair Pruning Based on Word Frequency

Mingliang Liang, Martha Larson

TL;DR

This paper tackles the data-efficiency challenge in pre-training vision-language models by proposing Word-Frequency-based Image-Text Pair Pruning (WFPP), a metadata-free method that prunes pairs based on text word frequencies to balance the corpus. WFPP assigns per-word discard probabilities and per-text scores to remove pairs containing frequent words, enabling substantial data reduction with minimal or improved zero-shot and retrieval performance compared to full-data CLIP. It demonstrates 1.3x speedup and comparable or better zero-shot results across multiple tasks, outperforming the metadata-based MetaCLIP approach and highlighting the importance of balanced word distributions in large-scale VLM pre-training. The study also analyzes how WFPP reshapes word-frequency distributions and preserves vocabulary richness, suggesting practical benefits for data-efficient training and guiding future exploration on larger-scale datasets and alternative pruning signals.

Abstract

We propose Word-Frequency-based Image-Text Pair Pruning (WFPP), a novel data pruning method that improves the efficiency of VLMs. Unlike MetaCLIP, our method does not need metadata for pruning, but selects text-image pairs to prune based on the content of the text. Specifically, WFPP prunes text-image pairs containing high-frequency words across the entire training dataset. The effect of WFPP is to reduce the dominance of frequent words. The result a better balanced word-frequency distribution in the dataset, which is known to improve the training of word embedding models. After pre-training on the pruned subset, we fine-tuned the model on the entire dataset for one additional epoch to achieve better performance. Our experiments demonstrate that applying WFPP when training a CLIP model improves performance on a wide range of downstream tasks. WFPP also provides the advantage of speeding up pre-training by using fewer samples. Additionally, we analyze the training data before and after pruning to visualize how WFPP changes the balance of word frequencies. We hope our work encourages researchers to consider the distribution of words in the training data when pre-training VLMs, not limited to CLIP.

Enhancing Vision-Language Model Pre-training with Image-text Pair Pruning Based on Word Frequency

TL;DR

Abstract

Enhancing Vision-Language Model Pre-training with Image-text Pair Pruning Based on Word Frequency

TL;DR

Abstract

Paper Structure

Table of Contents

Figures (7)