Table of Contents
Fetching ...

Preserving Multi-Modal Capabilities of Pre-trained VLMs for Improving Vision-Linguistic Compositionality

Youngtaek Oh, Jae Won Cho, Dong-Jin Kim, In So Kweon, Junmo Kim

TL;DR

This work tackles the trade-off between vision-language compositionality and multi-modal performance in pre-trained VLMs. It introduces FSC-CLIP, which integrates Local Hard Negative Loss based on dense patch-token alignments with Selective Calibrated Regularization to regularize hard-negative supervision, yielding $L_{total}=L_{clip}+\lambda_g L_{neg}^g+\lambda_l L_{neg}^l$ and leveraging SCR to balance signals. The approach achieves state-of-the-art compositionality scores across 11 benchmarks while preserving zero-shot classification and image-text retrieval capabilities, with additional gains when using LoRA fine-tuning. The results demonstrate that fine-grained local supervision plus calibrated regularization can improve compositional reasoning without sacrificing multi-modal representations, offering practical benefits for robust vision-language understanding. The work provides code and extensive evaluations, highlighting FSC-CLIP’s potential to advance compositionality in real-world multimodal applications.

Abstract

In this paper, we propose a new method to enhance compositional understanding in pre-trained vision and language models (VLMs) without sacrificing performance in zero-shot multi-modal tasks. Traditional fine-tuning approaches often improve compositional reasoning at the cost of degrading multi-modal capabilities, primarily due to the use of global hard negative (HN) loss, which contrasts global representations of images and texts. This global HN loss pushes HN texts that are highly similar to the original ones, damaging the model's multi-modal representations. To overcome this limitation, we propose Fine-grained Selective Calibrated CLIP (FSC-CLIP), which integrates local hard negative loss and selective calibrated regularization. These innovations provide fine-grained negative supervision while preserving the model's representational integrity. Our extensive evaluations across diverse benchmarks for both compositionality and multi-modal tasks show that FSC-CLIP not only achieves compositionality on par with state-of-the-art models but also retains strong multi-modal capabilities. Code is available at: https://github.com/ytaek-oh/fsc-clip.

Preserving Multi-Modal Capabilities of Pre-trained VLMs for Improving Vision-Linguistic Compositionality

TL;DR

This work tackles the trade-off between vision-language compositionality and multi-modal performance in pre-trained VLMs. It introduces FSC-CLIP, which integrates Local Hard Negative Loss based on dense patch-token alignments with Selective Calibrated Regularization to regularize hard-negative supervision, yielding and leveraging SCR to balance signals. The approach achieves state-of-the-art compositionality scores across 11 benchmarks while preserving zero-shot classification and image-text retrieval capabilities, with additional gains when using LoRA fine-tuning. The results demonstrate that fine-grained local supervision plus calibrated regularization can improve compositional reasoning without sacrificing multi-modal representations, offering practical benefits for robust vision-language understanding. The work provides code and extensive evaluations, highlighting FSC-CLIP’s potential to advance compositionality in real-world multimodal applications.

Abstract

In this paper, we propose a new method to enhance compositional understanding in pre-trained vision and language models (VLMs) without sacrificing performance in zero-shot multi-modal tasks. Traditional fine-tuning approaches often improve compositional reasoning at the cost of degrading multi-modal capabilities, primarily due to the use of global hard negative (HN) loss, which contrasts global representations of images and texts. This global HN loss pushes HN texts that are highly similar to the original ones, damaging the model's multi-modal representations. To overcome this limitation, we propose Fine-grained Selective Calibrated CLIP (FSC-CLIP), which integrates local hard negative loss and selective calibrated regularization. These innovations provide fine-grained negative supervision while preserving the model's representational integrity. Our extensive evaluations across diverse benchmarks for both compositionality and multi-modal tasks show that FSC-CLIP not only achieves compositionality on par with state-of-the-art models but also retains strong multi-modal capabilities. Code is available at: https://github.com/ytaek-oh/fsc-clip.
Paper Structure (21 sections, 11 equations, 6 figures, 10 tables)

This paper contains 21 sections, 11 equations, 6 figures, 10 tables.

Figures (6)

  • Figure 1: A holistic comparison of fine-tuning methods for vision-language compositionality. Enhancing compositionality often compromises multi-modal task performance in previous approaches. Our FSC-CLIP bridges this gap, minimizing these trade-offs. Full experimental results are provided in \ref{['tab:method_comparison']}.
  • Figure 2: A complete FSC-CLIP framework that incorporates Local Hard Negative (LHN) Loss with Selective Calibrated Regularization (SCR), alongside a global HN loss. The LHN loss measures similarity between an image and a text at the patch and token levels to more accurately identify subtle differences between original and HN texts. SCR combines focal loss with label smoothing to mitigate the adverse effects of using hard negative losses.
  • Figure 3: A conceptual illustration of the confidence-based weighting mechanism in HN loss. It reduces the adverse impact of HN supervision by lowering the signal from confident predictions while selectively focusing on challenging ones, crucial for learning compositionality.
  • Figure 4: Fine-tuning trajectories between compositionality (Comp) and zero-shot classification (ZS) via robust fine-tuning method wortsman2022robust. Each point represents the interpolated model between the pre-trained and each fine-tuned version, at varying ratios. FSC-CLIP offers better trade-offs between Comp and ZS, maintaining ZS scores in the fully fine-tuned model.
  • Figure 5: Image to text retrieval examples on COCO-CF dataset. CLIP and DAC-LLM often rank negative captions (marked with red crossmarks) as top-1, while FSC-CLIP consistently retrieves the correct caption (marked with green checkmarks), demonstrating superior fine-grained understanding and retrieval accuracy in challenging conditions.
  • ...and 1 more figures