Table of Contents
Fetching ...

Frequency Is What You Need: Considering Word Frequency When Text Masking Benefits Vision-Language Model Pre-training

Mingliang Liang, Martha Larson

TL;DR

This work examines how text-masking strategies used during Vision-Language Model pre-training shape word-frequency distributions in the training data and how these shifts relate to model performance. It introduces CLIPF, a frequency-based masking approach that uses $P(w_i) = 1 - \sqrt{\frac{t}{f(w_i)}}$ to prioritize masking of frequent words without POS tagging or hard cutoffs, and demonstrates its advantages over syntax masking, especially as the number of input tokens decreases. The study also shows that the relative performance of masking strategies depends on training duration, with frequency-preserving methods like CLIPF performing well across multiple datasets (CC3M, CC12M, LAION-400M) and tasks (zero-shot classification, image-text retrieval). Overall, the findings offer practical guidance for choosing text-masking strategies under computational constraints and point to ongoing opportunities to tune word-frequency distributions for improved VLM pre-training efficiency and effectiveness.

Abstract

Vision Language Models (VLMs) can be trained more efficiently if training sets can be reduced in size. Recent work has shown the benefits of masking text during VLM training using a variety of strategies (truncation, random masking, block masking and syntax masking) and has reported syntax masking as the top performer. In this paper, we analyze the impact of different text masking strategies on the word frequency in the training data, and show that this impact is connected to model success. This finding motivates Contrastive Language-Image Pre-training with Word Frequency Masking (CLIPF), our proposed masking approach, which directly leverages word frequency. Extensive experiments demonstrate the advantages of CLIPF over syntax masking and other existing approaches, particularly when the number of input tokens decreases. We show that not only CLIPF, but also other existing masking strategies, outperform syntax masking when enough epochs are used during training, a finding of practical importance for selecting a text masking method for VLM training. Our code is available online.

Frequency Is What You Need: Considering Word Frequency When Text Masking Benefits Vision-Language Model Pre-training

TL;DR

This work examines how text-masking strategies used during Vision-Language Model pre-training shape word-frequency distributions in the training data and how these shifts relate to model performance. It introduces CLIPF, a frequency-based masking approach that uses to prioritize masking of frequent words without POS tagging or hard cutoffs, and demonstrates its advantages over syntax masking, especially as the number of input tokens decreases. The study also shows that the relative performance of masking strategies depends on training duration, with frequency-preserving methods like CLIPF performing well across multiple datasets (CC3M, CC12M, LAION-400M) and tasks (zero-shot classification, image-text retrieval). Overall, the findings offer practical guidance for choosing text-masking strategies under computational constraints and point to ongoing opportunities to tune word-frequency distributions for improved VLM pre-training efficiency and effectiveness.

Abstract

Vision Language Models (VLMs) can be trained more efficiently if training sets can be reduced in size. Recent work has shown the benefits of masking text during VLM training using a variety of strategies (truncation, random masking, block masking and syntax masking) and has reported syntax masking as the top performer. In this paper, we analyze the impact of different text masking strategies on the word frequency in the training data, and show that this impact is connected to model success. This finding motivates Contrastive Language-Image Pre-training with Word Frequency Masking (CLIPF), our proposed masking approach, which directly leverages word frequency. Extensive experiments demonstrate the advantages of CLIPF over syntax masking and other existing approaches, particularly when the number of input tokens decreases. We show that not only CLIPF, but also other existing masking strategies, outperform syntax masking when enough epochs are used during training, a finding of practical importance for selecting a text masking method for VLM training. Our code is available online.

Paper Structure

This paper contains 33 sections, 3 equations, 10 figures, 21 tables.

Figures (10)

  • Figure 1: Zero-shot classification accuracy on ImageNet-1K using models trained on CC12M to which different text masking strategies have been applied. The backbone of the image encoder is ViT-B/16. The text masking ratio is 75%. We also use image masking (75%) to speed up pre-training. At each point, we apply an additional epoch of training on the full, unmasked data.
  • Figure 2: Comparison of word masking probabilities for various methods. In this example, we keep four words for truncation and syntax. Numbers indicate the probability of a word being masked. Truncation keeps the first 4 words; Random and Block mask each word with a 50% probability; Syntax prioritizes retaining nouns, followed by adjectives, then others; Frequency masks words based on the frequency of the words.
  • Figure 3: The figure illustrates the distribution of top-25 words in the text before and after applying various text masking strategies. We set the text length after text masking to 6. The x-axis represents the word index, which is sorted by counts of the original data, and the y-axis shows the word frequency. The dataset used is CC12M and the value of $t$ of \ref{['equ:sub']} is set to $10^{-6}$. We remove special characters from the vocabulary. The figure with more words is provided in the supplementary material.
  • Figure 4: Zero-shot classification accuracy on ImageNet-1K using models trained on LAION-400M to which different text masking strategies have been applied. The backbone of the image encoder is ViT-B/16. We pre-trained the model using 25 image tokens and 4 text tokens for 16 epochs. At each point, we apply an additional 0.4 epoch of training on the full, unmasked data.
  • Figure 5: The curve of Equation 2. The x-axis is the word frequency $f(w_i)$, and the y-axis is the $P(w_i)$. The value of $t$ of Equation 2 is set to $10^{-5}$, $10^{-6}$, $10^{-7}$,$10^{-8}$.
  • ...and 5 more figures