Table of Contents
Fetching ...

Efficient Vision-Language Pre-training by Cluster Masking

Zihao Wei, Zixuan Pan, Andrew Owens

TL;DR

This work addresses inefficiency in vision-language pre-training caused by dense image data by introducing cluster-based masking of image patches. The method randomly selects anchor patches and forms clusters based on patch RGB and, optionally, embedding features, masking entire clusters to provide an extra context-based learning signal while reducing per-image data, thereby speeding training. Training uses symmetric CLIP-style losses to align image and text representations, and the approach is evaluated on CC12M with ViT-B/16, showing improvements over CLIP and FLIP across zero-shot retrieval, zero-shot classification, linear probing, and language composition benchmarks, plus favorable ablations on normalization and feature fusion. The results demonstrate that a simple, cluster-based masking strategy can yield stronger representations and practical gains in training efficiency for large-scale vision-language pre-training.

Abstract

We propose a simple strategy for masking image patches during visual-language contrastive learning that improves the quality of the learned representations and the training speed. During each iteration of training, we randomly mask clusters of visually similar image patches, as measured by their raw pixel intensities. This provides an extra learning signal, beyond the contrastive training itself, since it forces a model to predict words for masked visual structures solely from context. It also speeds up training by reducing the amount of data used in each image. We evaluate the effectiveness of our model by pre-training on a number of benchmarks, finding that it outperforms other masking strategies, such as FLIP, on the quality of the learned representation.

Efficient Vision-Language Pre-training by Cluster Masking

TL;DR

This work addresses inefficiency in vision-language pre-training caused by dense image data by introducing cluster-based masking of image patches. The method randomly selects anchor patches and forms clusters based on patch RGB and, optionally, embedding features, masking entire clusters to provide an extra context-based learning signal while reducing per-image data, thereby speeding training. Training uses symmetric CLIP-style losses to align image and text representations, and the approach is evaluated on CC12M with ViT-B/16, showing improvements over CLIP and FLIP across zero-shot retrieval, zero-shot classification, linear probing, and language composition benchmarks, plus favorable ablations on normalization and feature fusion. The results demonstrate that a simple, cluster-based masking strategy can yield stronger representations and practical gains in training efficiency for large-scale vision-language pre-training.

Abstract

We propose a simple strategy for masking image patches during visual-language contrastive learning that improves the quality of the learned representations and the training speed. During each iteration of training, we randomly mask clusters of visually similar image patches, as measured by their raw pixel intensities. This provides an extra learning signal, beyond the contrastive training itself, since it forces a model to predict words for masked visual structures solely from context. It also speeds up training by reducing the amount of data used in each image. We evaluate the effectiveness of our model by pre-training on a number of benchmarks, finding that it outperforms other masking strategies, such as FLIP, on the quality of the learned representation.
Paper Structure (34 sections, 2 equations, 10 figures, 6 tables, 1 algorithm)

This paper contains 34 sections, 2 equations, 10 figures, 6 tables, 1 algorithm.

Figures (10)

  • Figure 1: Cluster masking. We mask random clusters of visually similar image patches when training contrastive vision-language models (bottom). This masking strategy distinguishes our approach from methods that independently mask image patches for efficiency flip (middle), while providing a similar improvement in training speed. It provides an extra learning signal, since it forces a model to predict words for missing scene structures solely from context.
  • Figure 2: Choosing clusters. The process begins by randomly selecting anchor patches from the image. We then calculate the pairwise distances among all patches. Clusters formed within a distance threshold are masked out. We show cluster obtained from a single anchor patch.
  • Figure 3: Visualization of cluster masks. Different colors represent distinct clusters formed by the similarity matrix calculated from the chosen anchor patches.
  • Figure 4: Generated caption from visible patches. We process the masked images through GPT-4 openai2023gpt4chatgpt to create captions for the unmasked segments.
  • Figure 5: Effect of anchor patch ratio. All the final masking ratio is tuned to be 50%.
  • ...and 5 more figures