Efficient Vision-Language Pre-training by Cluster Masking
Zihao Wei, Zixuan Pan, Andrew Owens
TL;DR
This work addresses inefficiency in vision-language pre-training caused by dense image data by introducing cluster-based masking of image patches. The method randomly selects anchor patches and forms clusters based on patch RGB and, optionally, embedding features, masking entire clusters to provide an extra context-based learning signal while reducing per-image data, thereby speeding training. Training uses symmetric CLIP-style losses to align image and text representations, and the approach is evaluated on CC12M with ViT-B/16, showing improvements over CLIP and FLIP across zero-shot retrieval, zero-shot classification, linear probing, and language composition benchmarks, plus favorable ablations on normalization and feature fusion. The results demonstrate that a simple, cluster-based masking strategy can yield stronger representations and practical gains in training efficiency for large-scale vision-language pre-training.
Abstract
We propose a simple strategy for masking image patches during visual-language contrastive learning that improves the quality of the learned representations and the training speed. During each iteration of training, we randomly mask clusters of visually similar image patches, as measured by their raw pixel intensities. This provides an extra learning signal, beyond the contrastive training itself, since it forces a model to predict words for masked visual structures solely from context. It also speeds up training by reducing the amount of data used in each image. We evaluate the effectiveness of our model by pre-training on a number of benchmarks, finding that it outperforms other masking strategies, such as FLIP, on the quality of the learned representation.
