Table of Contents
Fetching ...

Removing Distributional Discrepancies in Captions Improves Image-Text Alignment

Yuheng Li, Haotian Liu, Mu Cai, Yijun Li, Eli Shechtman, Zhe Lin, Yong Jae Lee, Krishna Kumar Singh

TL;DR

The paper tackles the problem of image–text alignment and compositional understanding in vision-language models by addressing distributional biases in negative caption data. It introduces a dual negative-caption generation strategy (replacing and swapping) powered by GPT, and a text-only data-filtering step to balance positive and negative caption distributions. The curated data is used to fine-tune leading VLMs (notably LLaVA-1.5 and BLIP2 ITM) to produce a robust image–text alignment score, termed LLaVA-score, achieving state-of-the-art results across multiple benchmarks and enabling ranking of T2I-generated images by alignment quality. The approach improves beyond textual cues, enhances compositional reasoning, and offers practical utility for data curation and evaluation in multimodal systems, with potential applicability to other modalities.

Abstract

In this paper, we introduce a model designed to improve the prediction of image-text alignment, targeting the challenge of compositional understanding in current visual-language models. Our approach focuses on generating high-quality training datasets for the alignment task by producing mixed-type negative captions derived from positive ones. Critically, we address the distribution imbalance between positive and negative captions to ensure that the alignment model does not depend solely on textual information but also considers the associated images for predicting alignment accurately. By creating this enhanced training data, we fine-tune an existing leading visual-language model to boost its capability in understanding alignment. Our model significantly outperforms current top-performing methods across various datasets. We also demonstrate the applicability of our model by ranking the images generated by text-to-image models based on text alignment. Project page: \url{https://yuheng-li.github.io/LLaVA-score/}

Removing Distributional Discrepancies in Captions Improves Image-Text Alignment

TL;DR

The paper tackles the problem of image–text alignment and compositional understanding in vision-language models by addressing distributional biases in negative caption data. It introduces a dual negative-caption generation strategy (replacing and swapping) powered by GPT, and a text-only data-filtering step to balance positive and negative caption distributions. The curated data is used to fine-tune leading VLMs (notably LLaVA-1.5 and BLIP2 ITM) to produce a robust image–text alignment score, termed LLaVA-score, achieving state-of-the-art results across multiple benchmarks and enabling ranking of T2I-generated images by alignment quality. The approach improves beyond textual cues, enhances compositional reasoning, and offers practical utility for data curation and evaluation in multimodal systems, with potential applicability to other modalities.

Abstract

In this paper, we introduce a model designed to improve the prediction of image-text alignment, targeting the challenge of compositional understanding in current visual-language models. Our approach focuses on generating high-quality training datasets for the alignment task by producing mixed-type negative captions derived from positive ones. Critically, we address the distribution imbalance between positive and negative captions to ensure that the alignment model does not depend solely on textual information but also considers the associated images for predicting alignment accurately. By creating this enhanced training data, we fine-tune an existing leading visual-language model to boost its capability in understanding alignment. Our model significantly outperforms current top-performing methods across various datasets. We also demonstrate the applicability of our model by ranking the images generated by text-to-image models based on text alignment. Project page: \url{https://yuheng-li.github.io/LLaVA-score/}
Paper Structure (18 sections, 4 equations, 9 figures, 5 tables)

This paper contains 18 sections, 4 equations, 9 figures, 5 tables.

Figures (9)

  • Figure 1: Left: Qualitative examples for image-text alignment prediction, where our approach can distinguish fundamental concepts such as positioning, counting, and attributes. Right: Our approach shows superior performance in image-text alignment.
  • Figure 2: Top: We feed the positive caption (green dots) into GPT to create two types of negative captions (red dots): substituting one linguistic element with any plausible alternative or swapping the positions of two components. The blue part in negative captions highlights the modifications. Bottom: we remove easy negative samples using only text data and utilize the remaining samples to fine-tune vision-language models.
  • Figure 3: Prompts used in GPT for generating two types of negative captions, with in-context examples shown in purple.
  • Figure 4: Top prediction based on a text-only binary classifier. Top left: Negative captions generated through replacement strategy. Top right: Negative captions generated through swapping strategy. Bottom: Positive captions from the COCO dataset.
  • Figure 5: Our curated test datasets feature captions paired with one positive image and one negative image each. All the positive images are displayed on the left side.
  • ...and 4 more figures