Table of Contents
Fetching ...

Contrasting Intra-Modal and Ranking Cross-Modal Hard Negatives to Enhance Visio-Linguistic Compositional Understanding

Le Zhang, Rabiul Awal, Aishwarya Agrawal

TL;DR

The paper addresses the limited compositional reasoning of vision-language models by extending the standard image-text contrastive objective with two annotation-free losses operating on automatically generated hard negatives. It introduces intra-modal contrastive loss to sharpen distinctions among hard negatives and a cross-modal rank loss with adaptive, type-specific thresholds that function as curriculum learning during fine-tuning. When applied to CLIP and XVLM, the method yields state-of-the-art gains across five VL compositional benchmarks (ARO, VALSE, VL-CheckList, SugarCrepe, etc.), including substantial improvements on relation and attribute reasoning, while preserving performance on standard retrieval tasks. The approach is annotation-free, scalable, and compatible with existing pretrained VLMs, highlighting a practical path to enhance fine-grained image-text grounding without additional resources.

Abstract

Vision-Language Models (VLMs), such as CLIP, exhibit strong image-text comprehension abilities, facilitating advances in several downstream tasks such as zero-shot image classification, image-text retrieval, and text-to-image generation. However, the compositional reasoning abilities of existing VLMs remains subpar. The root of this limitation lies in the inadequate alignment between the images and captions in the pretraining datasets. Additionally, the current contrastive learning objective fails to focus on fine-grained grounding components like relations, actions, and attributes, resulting in "bag-of-words" representations. We introduce a simple and effective method to improve compositional reasoning in VLMs. Our method better leverages available datasets by refining and expanding the standard image-text contrastive learning framework. Our approach does not require specific annotations and does not incur extra parameters. When integrated with CLIP, our technique yields notable improvement over state-of-the-art baselines across five vision-language compositional benchmarks. We open-source our code at https://github.com/lezhang7/Enhance-FineGrained.

Contrasting Intra-Modal and Ranking Cross-Modal Hard Negatives to Enhance Visio-Linguistic Compositional Understanding

TL;DR

The paper addresses the limited compositional reasoning of vision-language models by extending the standard image-text contrastive objective with two annotation-free losses operating on automatically generated hard negatives. It introduces intra-modal contrastive loss to sharpen distinctions among hard negatives and a cross-modal rank loss with adaptive, type-specific thresholds that function as curriculum learning during fine-tuning. When applied to CLIP and XVLM, the method yields state-of-the-art gains across five VL compositional benchmarks (ARO, VALSE, VL-CheckList, SugarCrepe, etc.), including substantial improvements on relation and attribute reasoning, while preserving performance on standard retrieval tasks. The approach is annotation-free, scalable, and compatible with existing pretrained VLMs, highlighting a practical path to enhance fine-grained image-text grounding without additional resources.

Abstract

Vision-Language Models (VLMs), such as CLIP, exhibit strong image-text comprehension abilities, facilitating advances in several downstream tasks such as zero-shot image classification, image-text retrieval, and text-to-image generation. However, the compositional reasoning abilities of existing VLMs remains subpar. The root of this limitation lies in the inadequate alignment between the images and captions in the pretraining datasets. Additionally, the current contrastive learning objective fails to focus on fine-grained grounding components like relations, actions, and attributes, resulting in "bag-of-words" representations. We introduce a simple and effective method to improve compositional reasoning in VLMs. Our method better leverages available datasets by refining and expanding the standard image-text contrastive learning framework. Our approach does not require specific annotations and does not incur extra parameters. When integrated with CLIP, our technique yields notable improvement over state-of-the-art baselines across five vision-language compositional benchmarks. We open-source our code at https://github.com/lezhang7/Enhance-FineGrained.
Paper Structure (28 sections, 9 equations, 11 figures, 6 tables)

This paper contains 28 sections, 9 equations, 11 figures, 6 tables.

Figures (11)

  • Figure 1: Models trained with standard image-text contrastive learning lack sufficient compositional reasoning abilities. Our method teaches the model to better differentiate between similar captions and learn fine-grained alignment between images and text to improve compositional reasoning.
  • Figure 2: (Top) An overview of our method's pipeline and hard negative generation examples. Losses are applied on the shaded boxes.
  • Figure 3: Conceptual loss comparison. Red arrows denote minimizing similarity, while green arrows denote maximize it; Dotted arrow represents data augmentation. (a) Standard image-text contrastive learning applied in radford2021learning. (b) Proposed intra-modal contrast applied on generated hard negative texts and (c) cross-modal rank applied on positive and hard negative pairs with adaptive threshold.
  • Figure 3: Results(%) on SugarCrepe. Vera and Grammar are text-only models.
  • Figure 4: Ablations on hard-negative types
  • ...and 6 more figures