CLoVe: Encoding Compositional Language in Contrastive Vision-Language Models
Santiago Castro, Amir Ziai, Avneesh Saluja, Zhuoning Yuan, Rada Mihalcea
TL;DR
This work addresses the gap in compositional language understanding in vision-language models by introducing CLoVe, a framework that augments CLIP-like two-tower VLMs with synthetic captions, hard negatives, and model patching. The approach leverages large-scale synthetic data from LAION-COCO, targeted hard negatives, and a weight-space patching strategy to combine improved compositionality with preserved object-recognition and retrieval performance. Empirical results on compositional benchmarks (e.g., SugarCrepe, ARO) show about a 10% absolute gain, with minimal degradation on standard tasks such as ImageNet, and the method is demonstrated through a CLIP case study with detailed ablations. The work provides practical tools (code and checkpoints) for enhancing language composition in CLIP-like models, offering a scalable path toward more robust visual reasoning and controlled image-language generation, while acknowledging limitations and future directions for broader model families and fairness analysis.
Abstract
Recent years have witnessed a significant increase in the performance of Vision and Language tasks. Foundational Vision-Language Models (VLMs), such as CLIP, have been leveraged in multiple settings and demonstrated remarkable performance across several tasks. Such models excel at object-centric recognition yet learn text representations that seem invariant to word order, failing to compose known concepts in novel ways. However, no evidence exists that any VLM, including large-scale single-stream models such as GPT-4V, identifies compositions successfully. In this paper, we introduce a framework to significantly improve the ability of existing models to encode compositional language, with over 10% absolute improvement on compositionality benchmarks, while maintaining or improving the performance on standard object-recognition and retrieval benchmarks. Our code and pre-trained models are publicly available at https://github.com/netflix/clove.
