Table of Contents
Fetching ...

CLoVe: Encoding Compositional Language in Contrastive Vision-Language Models

Santiago Castro, Amir Ziai, Avneesh Saluja, Zhuoning Yuan, Rada Mihalcea

TL;DR

This work addresses the gap in compositional language understanding in vision-language models by introducing CLoVe, a framework that augments CLIP-like two-tower VLMs with synthetic captions, hard negatives, and model patching. The approach leverages large-scale synthetic data from LAION-COCO, targeted hard negatives, and a weight-space patching strategy to combine improved compositionality with preserved object-recognition and retrieval performance. Empirical results on compositional benchmarks (e.g., SugarCrepe, ARO) show about a 10% absolute gain, with minimal degradation on standard tasks such as ImageNet, and the method is demonstrated through a CLIP case study with detailed ablations. The work provides practical tools (code and checkpoints) for enhancing language composition in CLIP-like models, offering a scalable path toward more robust visual reasoning and controlled image-language generation, while acknowledging limitations and future directions for broader model families and fairness analysis.

Abstract

Recent years have witnessed a significant increase in the performance of Vision and Language tasks. Foundational Vision-Language Models (VLMs), such as CLIP, have been leveraged in multiple settings and demonstrated remarkable performance across several tasks. Such models excel at object-centric recognition yet learn text representations that seem invariant to word order, failing to compose known concepts in novel ways. However, no evidence exists that any VLM, including large-scale single-stream models such as GPT-4V, identifies compositions successfully. In this paper, we introduce a framework to significantly improve the ability of existing models to encode compositional language, with over 10% absolute improvement on compositionality benchmarks, while maintaining or improving the performance on standard object-recognition and retrieval benchmarks. Our code and pre-trained models are publicly available at https://github.com/netflix/clove.

CLoVe: Encoding Compositional Language in Contrastive Vision-Language Models

TL;DR

This work addresses the gap in compositional language understanding in vision-language models by introducing CLoVe, a framework that augments CLIP-like two-tower VLMs with synthetic captions, hard negatives, and model patching. The approach leverages large-scale synthetic data from LAION-COCO, targeted hard negatives, and a weight-space patching strategy to combine improved compositionality with preserved object-recognition and retrieval performance. Empirical results on compositional benchmarks (e.g., SugarCrepe, ARO) show about a 10% absolute gain, with minimal degradation on standard tasks such as ImageNet, and the method is demonstrated through a CLIP case study with detailed ablations. The work provides practical tools (code and checkpoints) for enhancing language composition in CLIP-like models, offering a scalable path toward more robust visual reasoning and controlled image-language generation, while acknowledging limitations and future directions for broader model families and fairness analysis.

Abstract

Recent years have witnessed a significant increase in the performance of Vision and Language tasks. Foundational Vision-Language Models (VLMs), such as CLIP, have been leveraged in multiple settings and demonstrated remarkable performance across several tasks. Such models excel at object-centric recognition yet learn text representations that seem invariant to word order, failing to compose known concepts in novel ways. However, no evidence exists that any VLM, including large-scale single-stream models such as GPT-4V, identifies compositions successfully. In this paper, we introduce a framework to significantly improve the ability of existing models to encode compositional language, with over 10% absolute improvement on compositionality benchmarks, while maintaining or improving the performance on standard object-recognition and retrieval benchmarks. Our code and pre-trained models are publicly available at https://github.com/netflix/clove.
Paper Structure (21 sections, 1 equation, 3 figures, 8 tables)

This paper contains 21 sections, 1 equation, 3 figures, 8 tables.

Figures (3)

  • Figure 1: Our proposed framework CLoVe significantly improves the compositionality performance (as measured by an average of SugarCrepe's seven fine-grained tasks) of pre-trained CLIP-like models while preserving their performance on other downstream tasks (as measured by ImageNet). Comparisons with more benchmarks are presented in \ref{['tab:compositional-benchmark-results', 'tab:common-benchmark-results']}. Baselines: REPLACE sugarcrepe and NegCLIP aro.
  • Figure 2: Our CLoVe framework consists of three steps. First, obtain synthetic captions for a large image dataset. Second, fine-tune a pre-trained Contrastive VLM on it along with hard negative texts. Third, patch the original model with the fine-tuned one.
  • Figure 3: The effect of applying model patching to both an object-centric benchmark (ImageNet, imagenet; x-axis) and a compositionality benchmark (ARO, aro; the four y-axes represent its four tasks), when varying the value of the weight in the average, $\alpha$. The value of $\alpha$ varies from 0 (the pre-trained model) to 1 (the fine-tuned model) in 0.05 increments, and the lines connect such points. We can obtain models with good zero-shot performance in ImageNet and compositionality when $\alpha$ is around 0.4--0.7. Note the four y-axes were adjusted to make the pre-trained and fine-tuned model points match to focus on how the lines vary between them.