Table of Contents
Fetching ...

FINEMATCH: Aspect-based Fine-grained Image and Text Mismatch Detection and Correction

Hang Hua, Jing Shi, Kushal Kafle, Simon Jenni, Daoan Zhang, John Collomosse, Scott Cohen, Jiebo Luo

TL;DR

FineMatch introduces a novel, aspect-based benchmark for detecting and correcting fine-grained image-text mismatches across Entities, Attributes, Relations, and Numbers. It defines two tasks, MD and MD&C, and proposes ITM-IoU, a metric that couples exact-class matching with lexical and semantic similarity to ground-truth triplets, evaluated over a diverse 49,906-pair dataset constructed from GPT-synthesized, retrieved, and diffusion-generated data. The framework combines GPT-4V-guided aspect-graph parsing, node replacement, data debiasing, and human annotations to produce high-quality mismatches, facilitating both supervised fine-tuning and in-context learning experiments across state-of-the-art VLMs. Empirical results show FineMatch improves fine-grained mismatch detection and correction, while strong multimodal in-context learners still lag in this specialized task; an AutoAlign system demonstrates practical hallucination detection and corrective editing for text-to-image generation. Overall, FineMatch advances reliable image-text alignment and provides a practical pathway to reduce hallucinations in T2I systems by enabling end-to-end detection and correction of compositional discrepancies.

Abstract

Recent progress in large-scale pre-training has led to the development of advanced vision-language models (VLMs) with remarkable proficiency in comprehending and generating multimodal content. Despite the impressive ability to perform complex reasoning for VLMs, current models often struggle to effectively and precisely capture the compositional information on both the image and text sides. To address this, we propose FineMatch, a new aspect-based fine-grained text and image matching benchmark, focusing on text and image mismatch detection and correction. This benchmark introduces a novel task for boosting and evaluating the VLMs' compositionality for aspect-based fine-grained text and image matching. In this task, models are required to identify mismatched aspect phrases within a caption, determine the aspect's class, and propose corrections for an image-text pair that may contain between 0 and 3 mismatches. To evaluate the models' performance on this new task, we propose a new evaluation metric named ITM-IoU for which our experiments show a high correlation to human evaluation. In addition, we also provide a comprehensive experimental analysis of existing mainstream VLMs, including fully supervised learning and in-context learning settings. We have found that models trained on FineMatch demonstrate enhanced proficiency in detecting fine-grained text and image mismatches. Moreover, models (e.g., GPT-4V, Gemini Pro Vision) with strong abilities to perform multimodal in-context learning are not as skilled at fine-grained compositional image and text matching analysis. With FineMatch, we are able to build a system for text-to-image generation hallucination detection and correction.

FINEMATCH: Aspect-based Fine-grained Image and Text Mismatch Detection and Correction

TL;DR

FineMatch introduces a novel, aspect-based benchmark for detecting and correcting fine-grained image-text mismatches across Entities, Attributes, Relations, and Numbers. It defines two tasks, MD and MD&C, and proposes ITM-IoU, a metric that couples exact-class matching with lexical and semantic similarity to ground-truth triplets, evaluated over a diverse 49,906-pair dataset constructed from GPT-synthesized, retrieved, and diffusion-generated data. The framework combines GPT-4V-guided aspect-graph parsing, node replacement, data debiasing, and human annotations to produce high-quality mismatches, facilitating both supervised fine-tuning and in-context learning experiments across state-of-the-art VLMs. Empirical results show FineMatch improves fine-grained mismatch detection and correction, while strong multimodal in-context learners still lag in this specialized task; an AutoAlign system demonstrates practical hallucination detection and corrective editing for text-to-image generation. Overall, FineMatch advances reliable image-text alignment and provides a practical pathway to reduce hallucinations in T2I systems by enabling end-to-end detection and correction of compositional discrepancies.

Abstract

Recent progress in large-scale pre-training has led to the development of advanced vision-language models (VLMs) with remarkable proficiency in comprehending and generating multimodal content. Despite the impressive ability to perform complex reasoning for VLMs, current models often struggle to effectively and precisely capture the compositional information on both the image and text sides. To address this, we propose FineMatch, a new aspect-based fine-grained text and image matching benchmark, focusing on text and image mismatch detection and correction. This benchmark introduces a novel task for boosting and evaluating the VLMs' compositionality for aspect-based fine-grained text and image matching. In this task, models are required to identify mismatched aspect phrases within a caption, determine the aspect's class, and propose corrections for an image-text pair that may contain between 0 and 3 mismatches. To evaluate the models' performance on this new task, we propose a new evaluation metric named ITM-IoU for which our experiments show a high correlation to human evaluation. In addition, we also provide a comprehensive experimental analysis of existing mainstream VLMs, including fully supervised learning and in-context learning settings. We have found that models trained on FineMatch demonstrate enhanced proficiency in detecting fine-grained text and image mismatches. Moreover, models (e.g., GPT-4V, Gemini Pro Vision) with strong abilities to perform multimodal in-context learning are not as skilled at fine-grained compositional image and text matching analysis. With FineMatch, we are able to build a system for text-to-image generation hallucination detection and correction.
Paper Structure (36 sections, 8 equations, 19 figures, 4 tables, 1 algorithm)

This paper contains 36 sections, 8 equations, 19 figures, 4 tables, 1 algorithm.

Figures (19)

  • Figure 1: Given a text and image pair, FineMatch enables VLMs to detect the mismatched aspects and the aspect classes in the caption and then give the corresponding corrections.
  • Figure 2: Aspect graph parsing and node replacement for GPT-Synthesized text data.
  • Figure 3: The initial data source distribution (inner circle) and domain distribution (outer circle) for the FineMatch training set (left) and test set (right).
  • Figure 4: Data distribution of varying numbers of mismatched aspects across different data sources in FineMatch.
  • Figure 5: Data distribution of the mismatched aspect classes across the training, validation, and test sets in FineMatch.
  • ...and 14 more figures