Table of Contents
Fetching ...

Semantic Compositions Enhance Vision-Language Contrastive Learning

Maxwell Aladago, Lorenzo Torresani, Soroush Vosoughi

TL;DR

The paper addresses data-efficiency in vision-language pretraining by introducing CLIP-$\mathcal{C}$, a simple yet effective method that creates semantically composite image-caption pairs during CLIP pretraining. By blending captions (via concatenation with "and") and combining center-half crops from two images, sampled with rate $\rho$, CLIP-$\mathcal{C}$ provides richer semantic supervision without extra parameters or computational cost. Across CC3M, CC12M, and RedCaps, the approach yields notable gains in zero-shot image classification and cross-modal retrieval, with especially strong benefits in low-data settings, and maintains competitive linear probing performance. Extensive ablations confirm that semantic content in compositions, dynamic sampling, and the center-half image composition are key to the improvements, while alternative augmentation strategies and fixed pairings are less effective. Overall, CLIP-$\mathcal{C}$ offers a scalable, data-efficient enhancement for vision-language models and is particularly promising for domains with limited paired data.

Abstract

In the field of vision-language contrastive learning, models such as CLIP capitalize on matched image-caption pairs as positive examples and leverage within-batch non-matching pairs as negatives. This approach has led to remarkable outcomes in zero-shot image classification, cross-modal retrieval, and linear evaluation tasks. We show that the zero-shot classification and retrieval capabilities of CLIP-like models can be improved significantly through the introduction of semantically composite examples during pretraining. Inspired by CutMix in vision categorization, we create semantically composite image-caption pairs by merging elements from two distinct instances in the dataset via a novel procedure. Our method fuses the captions and blends 50% of each image to form a new composite sample. This simple technique (termed CLIP-C for CLIP Compositions), devoid of any additional computational overhead or increase in model parameters, significantly improves zero-shot image classification and cross-modal retrieval. The benefits of CLIP-C are particularly pronounced in settings with relatively limited pretraining data.

Semantic Compositions Enhance Vision-Language Contrastive Learning

TL;DR

The paper addresses data-efficiency in vision-language pretraining by introducing CLIP-, a simple yet effective method that creates semantically composite image-caption pairs during CLIP pretraining. By blending captions (via concatenation with "and") and combining center-half crops from two images, sampled with rate , CLIP- provides richer semantic supervision without extra parameters or computational cost. Across CC3M, CC12M, and RedCaps, the approach yields notable gains in zero-shot image classification and cross-modal retrieval, with especially strong benefits in low-data settings, and maintains competitive linear probing performance. Extensive ablations confirm that semantic content in compositions, dynamic sampling, and the center-half image composition are key to the improvements, while alternative augmentation strategies and fixed pairings are less effective. Overall, CLIP- offers a scalable, data-efficient enhancement for vision-language models and is particularly promising for domains with limited paired data.

Abstract

In the field of vision-language contrastive learning, models such as CLIP capitalize on matched image-caption pairs as positive examples and leverage within-batch non-matching pairs as negatives. This approach has led to remarkable outcomes in zero-shot image classification, cross-modal retrieval, and linear evaluation tasks. We show that the zero-shot classification and retrieval capabilities of CLIP-like models can be improved significantly through the introduction of semantically composite examples during pretraining. Inspired by CutMix in vision categorization, we create semantically composite image-caption pairs by merging elements from two distinct instances in the dataset via a novel procedure. Our method fuses the captions and blends 50% of each image to form a new composite sample. This simple technique (termed CLIP-C for CLIP Compositions), devoid of any additional computational overhead or increase in model parameters, significantly improves zero-shot image classification and cross-modal retrieval. The benefits of CLIP-C are particularly pronounced in settings with relatively limited pretraining data.
Paper Structure (24 sections, 3 equations, 6 figures, 14 tables)

This paper contains 24 sections, 3 equations, 6 figures, 14 tables.

Figures (6)

  • Figure 1: CLIP-$\mathcal{C}$: We use the center half crops spanning the width (as in this illustration) or the height of the image. The captions are concatenated with the delimiter "and". We vary the positions of the captions on either side of the conjunction, i.e., the output caption can be either (a) $\{\text{caption1 } \text{and } \text{caption2}\}$ or (b) $\{\text{caption2 } \text{and } \text{caption1}\}$. We emphasize that only a fraction of the batch in each iteration constitute composite samples. The colored boxes and texts shown here are for illustrative purposes.
  • Figure 2: Counter-intuitively, the model learns to match the composite examples faster compared to the plain instances.
  • Figure 3: CLIP-$\mathcal{C}$ generally produces higher cosine similarity for matching pairs than CLIP.
  • Figure 4: CLIP-$\mathcal{C}$ v.s. CLIP. Pretraining CLIP longer than CLIP-$\mathcal{C}$ does not close the performance gap. CLIP-$\mathcal{C}$ becomes even more superior as training duration increases.
  • Figure 5: Sampling probability $\rho$. Our method is very effective when between 10% and 50% of the mini-batch are CLIP-$\mathcal{C}$ compositions but performs poorly when the entire batch is composite instances.
  • ...and 1 more figures