Table of Contents
Fetching ...

Enhancing Vision-Language Model Pre-training with Image-text Pair Pruning Based on Word Frequency

Mingliang Liang, Martha Larson

TL;DR

This paper tackles the data-efficiency challenge in pre-training vision-language models by proposing Word-Frequency-based Image-Text Pair Pruning (WFPP), a metadata-free method that prunes pairs based on text word frequencies to balance the corpus. WFPP assigns per-word discard probabilities and per-text scores to remove pairs containing frequent words, enabling substantial data reduction with minimal or improved zero-shot and retrieval performance compared to full-data CLIP. It demonstrates 1.3x speedup and comparable or better zero-shot results across multiple tasks, outperforming the metadata-based MetaCLIP approach and highlighting the importance of balanced word distributions in large-scale VLM pre-training. The study also analyzes how WFPP reshapes word-frequency distributions and preserves vocabulary richness, suggesting practical benefits for data-efficient training and guiding future exploration on larger-scale datasets and alternative pruning signals.

Abstract

We propose Word-Frequency-based Image-Text Pair Pruning (WFPP), a novel data pruning method that improves the efficiency of VLMs. Unlike MetaCLIP, our method does not need metadata for pruning, but selects text-image pairs to prune based on the content of the text. Specifically, WFPP prunes text-image pairs containing high-frequency words across the entire training dataset. The effect of WFPP is to reduce the dominance of frequent words. The result a better balanced word-frequency distribution in the dataset, which is known to improve the training of word embedding models. After pre-training on the pruned subset, we fine-tuned the model on the entire dataset for one additional epoch to achieve better performance. Our experiments demonstrate that applying WFPP when training a CLIP model improves performance on a wide range of downstream tasks. WFPP also provides the advantage of speeding up pre-training by using fewer samples. Additionally, we analyze the training data before and after pruning to visualize how WFPP changes the balance of word frequencies. We hope our work encourages researchers to consider the distribution of words in the training data when pre-training VLMs, not limited to CLIP.

Enhancing Vision-Language Model Pre-training with Image-text Pair Pruning Based on Word Frequency

TL;DR

This paper tackles the data-efficiency challenge in pre-training vision-language models by proposing Word-Frequency-based Image-Text Pair Pruning (WFPP), a metadata-free method that prunes pairs based on text word frequencies to balance the corpus. WFPP assigns per-word discard probabilities and per-text scores to remove pairs containing frequent words, enabling substantial data reduction with minimal or improved zero-shot and retrieval performance compared to full-data CLIP. It demonstrates 1.3x speedup and comparable or better zero-shot results across multiple tasks, outperforming the metadata-based MetaCLIP approach and highlighting the importance of balanced word distributions in large-scale VLM pre-training. The study also analyzes how WFPP reshapes word-frequency distributions and preserves vocabulary richness, suggesting practical benefits for data-efficient training and guiding future exploration on larger-scale datasets and alternative pruning signals.

Abstract

We propose Word-Frequency-based Image-Text Pair Pruning (WFPP), a novel data pruning method that improves the efficiency of VLMs. Unlike MetaCLIP, our method does not need metadata for pruning, but selects text-image pairs to prune based on the content of the text. Specifically, WFPP prunes text-image pairs containing high-frequency words across the entire training dataset. The effect of WFPP is to reduce the dominance of frequent words. The result a better balanced word-frequency distribution in the dataset, which is known to improve the training of word embedding models. After pre-training on the pruned subset, we fine-tuned the model on the entire dataset for one additional epoch to achieve better performance. Our experiments demonstrate that applying WFPP when training a CLIP model improves performance on a wide range of downstream tasks. WFPP also provides the advantage of speeding up pre-training by using fewer samples. Additionally, we analyze the training data before and after pruning to visualize how WFPP changes the balance of word frequencies. We hope our work encourages researchers to consider the distribution of words in the training data when pre-training VLMs, not limited to CLIP.

Paper Structure

This paper contains 19 sections, 4 equations, 7 figures, 22 tables.

Figures (7)

  • Figure 1: Zero-shot accuracy on ImageNet-1K classification. CLIP is trained on the CC12M dataset changpinyo2021CC12M. Using our Word-Frequency-based Image-Text Pair Pruning (WFPP), we achieve comparable performance, while using only approximately 77% of the image-text pairs ($1.3\times$ speedup). The image encoder is ViT-B-16 dosovitskiy2020ViT. The "ft" is fine-tuning. The w/o ft is without fine-tuning. "Samples seen" refers to the number of samples processed during pre-training.
  • Figure 2: Word Distribution: The top-50 words in CC12M changpinyo2021CC12M are shown after pruning 50% of image-text pairs using Random (orange) and WFPP (green) methods, and before pruning (black). We then calculate the word percentages for Random and WFPP before and after pruning. Words are ordered by frequency before pruning. The left Y-axis is the number of words and the left Y-axis is the percentage of words which is the number of words before data pruning divided by the number of words after data pruning.
  • Figure 3: Zero-shot accuracy on ImageNet-1K classification. The CC3M dataset sharma2018cc3m was pruned with WFPP. We see that CLIP trained on the WFPP-pruned data achieved comparable performance with CLIP trained on unpruned data, but uses only approximately 70% of original training data (indicated by the 1.4x speed-up mark). On the left, we see examples of the performance of data pruned with the MetaCLIP method, which remains below the performance of data pruned with WFPP. The image encoder is ViT-B-16 dosovitskiy2020ViT. The "ft" is an initial for fine-tuning. "Samples seen" refers to the number of samples processed during pre-training.
  • Figure 4: Word Distribution: The top-100 words in CC12M changpinyo2021CC12M are shown after pruning 50% of image-text pairs using Random (orange) and WFPP (green) methods, and before pruning (black). We then calculate the word percentages for Random and WFPP before and after pruning. Words are ordered by frequency before pruning. The left Y-axis is the number of words and the left Y-axis is the percentage of words which is the number of words before data pruning divided by the number of words after data pruning.
  • Figure 5: Word Distribution: The top 100 words from CC12M changpinyo2021CC12M are presented in two parts: the first and second 50% of the image-text pairs using the WFPP method. We then calculate the word percentages for WFPP before and after pruning. Words are ordered by frequency before pruning. The left Y-axis is the number of words and the left Y-axis is the percentage of words which is the number of words before data pruning divided by the number of words after data pruning.
  • ...and 2 more figures