Table of Contents
Fetching ...

Extract Free Dense Misalignment from CLIP

JeongYeon Nam, Jinbae Im, Wonjae Kim, Taeho Kil

TL;DR

This work introduces CLIP4DM, a zero-shot method to detect dense word-level misalignments between images and text by extracting token-level attributions from pre-trained CLIP. It innovates by enabling negative gradient flow in gradient-based explanations, aggregating attributions across layers, and combining them with a global score to form F-CLIPScore, a local-global misalignment metric. Across benchmarks like FOIL, nocaps-FOIL, HAT, SeeTRUE-Feedback, and Rich-HF, CLIP4DM achieves state-of-the-art zero-shot results and competitive performance with fine-tuned models while offering superior efficiency. The approach reveals strengths in identifying entity-level, intangible, and attribute-based misalignments, but also inherits some CLIP biases and background-object detection challenges, guiding future improvements via CLIP variants and longer-context modeling.

Abstract

Recent vision-language foundation models still frequently produce outputs misaligned with their inputs, evidenced by object hallucination in captioning and prompt misalignment in the text-to-image generation model. Recent studies have explored methods for identifying misaligned elements, aiming not only to enhance interpretability but also to improve model performance. However, current approaches primarily rely on large foundation models in a zero-shot manner or fine-tuned models with human annotations, which limits scalability due to significant computational costs. This work proposes a novel approach, dubbed CLIP4DM, for detecting dense misalignments from pre-trained CLIP, specifically focusing on pinpointing misaligned words between image and text. We carefully revamp the gradient-based attribution computation method, enabling negative gradient of individual text tokens to indicate misalignment. We also propose F-CLIPScore, which aggregates misaligned attributions with a global alignment score. We evaluate our method on various dense misalignment detection benchmarks, covering various image and text domains and misalignment types. Our method demonstrates state-of-the-art performance among zero-shot models and competitive performance with fine-tuned models while maintaining superior efficiency. Our qualitative examples show that our method has a unique strength to detect entity-level objects, intangible objects, and attributes that can not be easily detected for existing works. We conduct ablation studies and analyses to highlight the strengths and limitations of our approach. Our code is publicly available at https://github.com/naver-ai/CLIP4DM.

Extract Free Dense Misalignment from CLIP

TL;DR

This work introduces CLIP4DM, a zero-shot method to detect dense word-level misalignments between images and text by extracting token-level attributions from pre-trained CLIP. It innovates by enabling negative gradient flow in gradient-based explanations, aggregating attributions across layers, and combining them with a global score to form F-CLIPScore, a local-global misalignment metric. Across benchmarks like FOIL, nocaps-FOIL, HAT, SeeTRUE-Feedback, and Rich-HF, CLIP4DM achieves state-of-the-art zero-shot results and competitive performance with fine-tuned models while offering superior efficiency. The approach reveals strengths in identifying entity-level, intangible, and attribute-based misalignments, but also inherits some CLIP biases and background-object detection challenges, guiding future improvements via CLIP variants and longer-context modeling.

Abstract

Recent vision-language foundation models still frequently produce outputs misaligned with their inputs, evidenced by object hallucination in captioning and prompt misalignment in the text-to-image generation model. Recent studies have explored methods for identifying misaligned elements, aiming not only to enhance interpretability but also to improve model performance. However, current approaches primarily rely on large foundation models in a zero-shot manner or fine-tuned models with human annotations, which limits scalability due to significant computational costs. This work proposes a novel approach, dubbed CLIP4DM, for detecting dense misalignments from pre-trained CLIP, specifically focusing on pinpointing misaligned words between image and text. We carefully revamp the gradient-based attribution computation method, enabling negative gradient of individual text tokens to indicate misalignment. We also propose F-CLIPScore, which aggregates misaligned attributions with a global alignment score. We evaluate our method on various dense misalignment detection benchmarks, covering various image and text domains and misalignment types. Our method demonstrates state-of-the-art performance among zero-shot models and competitive performance with fine-tuned models while maintaining superior efficiency. Our qualitative examples show that our method has a unique strength to detect entity-level objects, intangible objects, and attributes that can not be easily detected for existing works. We conduct ablation studies and analyses to highlight the strengths and limitations of our approach. Our code is publicly available at https://github.com/naver-ai/CLIP4DM.

Paper Structure

This paper contains 39 sections, 11 equations, 17 figures, 14 tables.

Figures (17)

  • Figure 1: Overview of our work. CLIPScore indicates the alignment between the image and text in a single scalar score, limiting the interpretation of the score. Our approach extracts both positive and negative attributions to identify misaligned tokens between the image and text caption.
  • Figure 2: Qualitative examples on FOIL, nocaps-FOIL, and Rich-HF datasets. Misaligned words are highlighted in red in captions paired with images. Note that misaligned words may not exist. For predicted misaligned words, correct words are shown in green and incorrect words in red. If our model predicts that there are no misaligned words, it is indicated as '-'.
  • Figure 3: Ablation on the number of text encoder layers used for attribution calculation on nocaps-FOIL dataset.
  • Figure 4: Qualitative examples compared to ALOHa on HAT dataset. Our method demonstrates improved robustness in various misalignment types.
  • Figure 5: Qualitative examples compared to MiniGPT-v2 on SeeTRUE-Feedback dataset. MiniGPT-v2 generates lengthy and unformalized responses that are hard to parse into misaligned words for most examples.
  • ...and 12 more figures