Table of Contents
Fetching ...

Mismatch Quest: Visual and Textual Feedback for Image-Text Misalignment

Brian Gordon, Yonatan Bitton, Yonatan Shafir, Roopal Garg, Xi Chen, Dani Lischinski, Daniel Cohen-Or, Idan Szpektor

TL;DR

This work addresses the lack of actionable explanations in image–text alignment by introducing a feedback-centric paradigm that jointly predicts alignment and provides textual as well as visual misalignment explanations. It introduces ConGen-Feedback to generate a large TV-Feedback training set and SeeTRUE-Feedback as a human-annotated benchmark, enabling end-to-end training and evaluation of feedback-capable vision–language models. Fine-tuning PaLI models on TV-Feedback yields state-of-the-art performance across binary alignment, textual feedback, and visual localization, with strong generalization to out-of-distribution prompts and models. The results highlight the practical value of explicit misalignment feedback for improving text-to-image generation, dataset annotation quality, and image captioning, and open avenues for richer, feedback-driven VLM development.

Abstract

While existing image-text alignment models reach high quality binary assessments, they fall short of pinpointing the exact source of misalignment. In this paper, we present a method to provide detailed textual and visual explanation of detected misalignments between text-image pairs. We leverage large language models and visual grounding models to automatically construct a training set that holds plausible misaligned captions for a given image and corresponding textual explanations and visual indicators. We also publish a new human curated test set comprising ground-truth textual and visual misalignment annotations. Empirical results show that fine-tuning vision language models on our training set enables them to articulate misalignments and visually indicate them within images, outperforming strong baselines both on the binary alignment classification and the explanation generation tasks. Our method code and human curated test set are available at: https://mismatch-quest.github.io/

Mismatch Quest: Visual and Textual Feedback for Image-Text Misalignment

TL;DR

This work addresses the lack of actionable explanations in image–text alignment by introducing a feedback-centric paradigm that jointly predicts alignment and provides textual as well as visual misalignment explanations. It introduces ConGen-Feedback to generate a large TV-Feedback training set and SeeTRUE-Feedback as a human-annotated benchmark, enabling end-to-end training and evaluation of feedback-capable vision–language models. Fine-tuning PaLI models on TV-Feedback yields state-of-the-art performance across binary alignment, textual feedback, and visual localization, with strong generalization to out-of-distribution prompts and models. The results highlight the practical value of explicit misalignment feedback for improving text-to-image generation, dataset annotation quality, and image captioning, and open avenues for richer, feedback-driven VLM development.

Abstract

While existing image-text alignment models reach high quality binary assessments, they fall short of pinpointing the exact source of misalignment. In this paper, we present a method to provide detailed textual and visual explanation of detected misalignments between text-image pairs. We leverage large language models and visual grounding models to automatically construct a training set that holds plausible misaligned captions for a given image and corresponding textual explanations and visual indicators. We also publish a new human curated test set comprising ground-truth textual and visual misalignment annotations. Empirical results show that fine-tuning vision language models on our training set enables them to articulate misalignments and visually indicate them within images, outperforming strong baselines both on the binary alignment classification and the explanation generation tasks. Our method code and human curated test set are available at: https://mismatch-quest.github.io/
Paper Structure (25 sections, 20 figures, 4 tables)

This paper contains 25 sections, 20 figures, 4 tables.

Figures (20)

  • Figure 1: Our alignment model steps: (1) the model predicts the alignment label between the input image/text pairs; (2) for misalignment labels, it then generates textual and visual feedback.
  • Figure 2: Qualitative analysis of out-of-distribution results: Showcasing image-text pairs generated by Stable-Diffusion XL podell2023sdxl_stable_difussion, Stable-Diffusion 2.1 rombach2021highresolution, Adobe Firefly Adobe_Firefly and Composable Diffusion composable_diffusion (credits to attend_and_excite) text-to-image models alongside the corresponding textual and visual feedback as predicted by the PaLI-X model finetuned on TV-Feedback
  • Figure 3: The ConGen-Feedback data generation method: Top image shows a synthetic image from PickaPic with a predicted caption; Bottom image is a natural image from COCO with its longest available caption. Both undergo LLM processing to generate contradictions, feedback, textual misalignment labels, and visual misalignment labels, followed by visual bounding box generation.
  • Figure 4: SeeTRUE-Feedback annotation Amazon Mechanical Turk interface, questioning whether each part of the feedback, misalignment in text and misalignment in image are correct or not.
  • Figure 5: Metric results on the SeeTRUE-Feedback, showcasing calculations given the input, ground truth, and PaLI ft. model predictions, with NLI entailment scores calculated with BART NLI. The first row shows a high-scoring success example, while the second highlights a low-scoring failure with incorrect feedback and predictions.
  • ...and 15 more figures