Semantic Compositions Enhance Vision-Language Contrastive Learning
Maxwell Aladago, Lorenzo Torresani, Soroush Vosoughi
TL;DR
The paper addresses data-efficiency in vision-language pretraining by introducing CLIP-$\mathcal{C}$, a simple yet effective method that creates semantically composite image-caption pairs during CLIP pretraining. By blending captions (via concatenation with "and") and combining center-half crops from two images, sampled with rate $\rho$, CLIP-$\mathcal{C}$ provides richer semantic supervision without extra parameters or computational cost. Across CC3M, CC12M, and RedCaps, the approach yields notable gains in zero-shot image classification and cross-modal retrieval, with especially strong benefits in low-data settings, and maintains competitive linear probing performance. Extensive ablations confirm that semantic content in compositions, dynamic sampling, and the center-half image composition are key to the improvements, while alternative augmentation strategies and fixed pairings are less effective. Overall, CLIP-$\mathcal{C}$ offers a scalable, data-efficient enhancement for vision-language models and is particularly promising for domains with limited paired data.
Abstract
In the field of vision-language contrastive learning, models such as CLIP capitalize on matched image-caption pairs as positive examples and leverage within-batch non-matching pairs as negatives. This approach has led to remarkable outcomes in zero-shot image classification, cross-modal retrieval, and linear evaluation tasks. We show that the zero-shot classification and retrieval capabilities of CLIP-like models can be improved significantly through the introduction of semantically composite examples during pretraining. Inspired by CutMix in vision categorization, we create semantically composite image-caption pairs by merging elements from two distinct instances in the dataset via a novel procedure. Our method fuses the captions and blends 50% of each image to form a new composite sample. This simple technique (termed CLIP-C for CLIP Compositions), devoid of any additional computational overhead or increase in model parameters, significantly improves zero-shot image classification and cross-modal retrieval. The benefits of CLIP-C are particularly pronounced in settings with relatively limited pretraining data.
