Table of Contents
Fetching ...

Contrastive vision-language learning with paraphrasing and negation

Kwun Ho Ngan, Saman Sadeghi Afgeh, Joe Townsend, Artur d'Avila Garcez

TL;DR

The paper addresses the semantic fragility of vision-language models to paraphrase and negation by introducing SemCLIP, which adds a projection-based semantic space and two losses to the standard CLIP objective. The total loss is L_total = (α L_contrastive + β L_paraphrase + γ L_negation) / (α+β+γ), with L_paraphrase = 1 - cos(p(t), p(t^+)) and L_negation = max(0, cos(p(t), p(t^-))) where p(t) = V^T t and V contains n orthonormal projection vectors. Caption augmentation generates paraphrased and negated captions c^+ and c^- for each image-caption pair, while embeddings t, t^+, t^- are normalized and projected to a low-dimensional subspace to encode semantic relations. Empirically, SemCLIP preserves CLIP performance on original captions and significantly improves the original-vs-negation robustness, achieving 78.1% original-over-negation accuracy on CC-Neg (vs 68.1% for the baseline), with competitive results on Sugarcrepe++ and improved zero-shot classification when pre-trained on Sugarcrepe++. These findings demonstrate that joint modeling of paraphrase and negation in a projection space yields substantial robustness to semantic transformations with practical benefits for retrieval and downstream tasks.

Abstract

Contrastive vision-language models continue to be the dominant approach for image and text retrieval. Contrastive Language-Image Pre-training (CLIP) trains two neural networks in contrastive manner to align their image and text embeddings in a shared latent space. Recent results evaluating CLIP on negated or paraphrased text have shown mixed performance because negation changes meaning radically with minimal lexical changes, while paraphrasing can create very different textual expressions with the same intended meaning. This poses a significant challenge for improving the evaluation results and alignment of vision-language models. To address this challenge, this paper evaluates the combination of paraphrasing and negation, proposes a new CLIP contrastive loss function accounting for both paraphrasing and negation, and applies LLM-generated training triples consisting of original, paraphrased and negated textual captions to CLIP-like training models. The approach, called SemCLIP, is shown to move paraphrased captions towards the original image embeddings while pushing negated captions further away in embedding space. Empirically, SemCLIP is shown to be capable of preserving CLIP's performance while increasing considerably the distances to negated captions. On the CC-Neg benchmark using an original over negation image-retrieval accuracy metric, SemCLIP improves accuracy from 68.1% to 78.1%. Although results are mixed when compared with CLIP on the Sugarcrepe++ benchmark, SemCLIP's performance is generally better than the models trained with negated captions. This robustness to negation extends to downstream zero-shot classification tasks where SemCLIP pre-trained on Sugarcrepe++ performs better than CLIP on all tested downstream tasks. These results indicate that SemCLIP can achieve significant robustness to semantic transformations.

Contrastive vision-language learning with paraphrasing and negation

TL;DR

The paper addresses the semantic fragility of vision-language models to paraphrase and negation by introducing SemCLIP, which adds a projection-based semantic space and two losses to the standard CLIP objective. The total loss is L_total = (α L_contrastive + β L_paraphrase + γ L_negation) / (α+β+γ), with L_paraphrase = 1 - cos(p(t), p(t^+)) and L_negation = max(0, cos(p(t), p(t^-))) where p(t) = V^T t and V contains n orthonormal projection vectors. Caption augmentation generates paraphrased and negated captions c^+ and c^- for each image-caption pair, while embeddings t, t^+, t^- are normalized and projected to a low-dimensional subspace to encode semantic relations. Empirically, SemCLIP preserves CLIP performance on original captions and significantly improves the original-vs-negation robustness, achieving 78.1% original-over-negation accuracy on CC-Neg (vs 68.1% for the baseline), with competitive results on Sugarcrepe++ and improved zero-shot classification when pre-trained on Sugarcrepe++. These findings demonstrate that joint modeling of paraphrase and negation in a projection space yields substantial robustness to semantic transformations with practical benefits for retrieval and downstream tasks.

Abstract

Contrastive vision-language models continue to be the dominant approach for image and text retrieval. Contrastive Language-Image Pre-training (CLIP) trains two neural networks in contrastive manner to align their image and text embeddings in a shared latent space. Recent results evaluating CLIP on negated or paraphrased text have shown mixed performance because negation changes meaning radically with minimal lexical changes, while paraphrasing can create very different textual expressions with the same intended meaning. This poses a significant challenge for improving the evaluation results and alignment of vision-language models. To address this challenge, this paper evaluates the combination of paraphrasing and negation, proposes a new CLIP contrastive loss function accounting for both paraphrasing and negation, and applies LLM-generated training triples consisting of original, paraphrased and negated textual captions to CLIP-like training models. The approach, called SemCLIP, is shown to move paraphrased captions towards the original image embeddings while pushing negated captions further away in embedding space. Empirically, SemCLIP is shown to be capable of preserving CLIP's performance while increasing considerably the distances to negated captions. On the CC-Neg benchmark using an original over negation image-retrieval accuracy metric, SemCLIP improves accuracy from 68.1% to 78.1%. Although results are mixed when compared with CLIP on the Sugarcrepe++ benchmark, SemCLIP's performance is generally better than the models trained with negated captions. This robustness to negation extends to downstream zero-shot classification tasks where SemCLIP pre-trained on Sugarcrepe++ performs better than CLIP on all tested downstream tasks. These results indicate that SemCLIP can achieve significant robustness to semantic transformations.

Paper Structure

This paper contains 13 sections, 8 equations, 10 figures, 2 tables.

Figures (10)

  • Figure 1: SemCLIP architecture including the contrastive loss $L_{contrastive}$, paraphrasing loss $L_{paraphrase}$ and negation loss $L_{negation}$ showing how the embeddings of image-caption pairs are aligned through the proposed embedding projection and new training loss function (Eq. \ref{['eqn:total_train_loss']}).
  • Figure 2: Model robustness to negation is measured by the difference in accuracy (Mean Accuracy Delta) between positive and negative captions, e.g. "This is not a photo of a < class>." Any negative Delta is shown as zero. SemCLIP achieves the highest Delta on three of five downstream datasets when finetuned on CC-Neg (a) and on all datasets when finetuned on Sugarcrepe++ (SCPP) (b); further detailed evaluations are reported in Appendix C.
  • Figure 3: Effect of setting the number of projection vectors on image matching accuracies using trained model (finetuned with CC-Neg dataset singh2024learn).
  • Figure 4: Effect of setting the use of learnable projection vectors on image matching accuracies using trained model (finetuned with CC-Neg dataset singh2024learn).
  • Figure 5: Effect of setting the projection vectors normalization on image matching accuracies using trained model (finetuned with CC-Neg dataset singh2024learn).
  • ...and 5 more figures