FINEMATCH: Aspect-based Fine-grained Image and Text Mismatch Detection and Correction
Hang Hua, Jing Shi, Kushal Kafle, Simon Jenni, Daoan Zhang, John Collomosse, Scott Cohen, Jiebo Luo
TL;DR
FineMatch introduces a novel, aspect-based benchmark for detecting and correcting fine-grained image-text mismatches across Entities, Attributes, Relations, and Numbers. It defines two tasks, MD and MD&C, and proposes ITM-IoU, a metric that couples exact-class matching with lexical and semantic similarity to ground-truth triplets, evaluated over a diverse 49,906-pair dataset constructed from GPT-synthesized, retrieved, and diffusion-generated data. The framework combines GPT-4V-guided aspect-graph parsing, node replacement, data debiasing, and human annotations to produce high-quality mismatches, facilitating both supervised fine-tuning and in-context learning experiments across state-of-the-art VLMs. Empirical results show FineMatch improves fine-grained mismatch detection and correction, while strong multimodal in-context learners still lag in this specialized task; an AutoAlign system demonstrates practical hallucination detection and corrective editing for text-to-image generation. Overall, FineMatch advances reliable image-text alignment and provides a practical pathway to reduce hallucinations in T2I systems by enabling end-to-end detection and correction of compositional discrepancies.
Abstract
Recent progress in large-scale pre-training has led to the development of advanced vision-language models (VLMs) with remarkable proficiency in comprehending and generating multimodal content. Despite the impressive ability to perform complex reasoning for VLMs, current models often struggle to effectively and precisely capture the compositional information on both the image and text sides. To address this, we propose FineMatch, a new aspect-based fine-grained text and image matching benchmark, focusing on text and image mismatch detection and correction. This benchmark introduces a novel task for boosting and evaluating the VLMs' compositionality for aspect-based fine-grained text and image matching. In this task, models are required to identify mismatched aspect phrases within a caption, determine the aspect's class, and propose corrections for an image-text pair that may contain between 0 and 3 mismatches. To evaluate the models' performance on this new task, we propose a new evaluation metric named ITM-IoU for which our experiments show a high correlation to human evaluation. In addition, we also provide a comprehensive experimental analysis of existing mainstream VLMs, including fully supervised learning and in-context learning settings. We have found that models trained on FineMatch demonstrate enhanced proficiency in detecting fine-grained text and image mismatches. Moreover, models (e.g., GPT-4V, Gemini Pro Vision) with strong abilities to perform multimodal in-context learning are not as skilled at fine-grained compositional image and text matching analysis. With FineMatch, we are able to build a system for text-to-image generation hallucination detection and correction.
