Table of Contents
Fetching ...

AlignBench: Benchmarking Fine-Grained Image-Text Alignment with Synthetic Image-Caption Pairs

Kuniaki Saito, Risa Shinoda, Shohei Tanaka, Tosho Hirasawa, Fumio Okura, Yoshitaka Ushiku

TL;DR

<3-5 sentence high-level summary> AlignBench introduces a large-scale, fine-grained benchmark for image-text alignment built from synthetic captions generated by diverse captioners and text-to-image models, with sentence-level correctness and hallucination-type annotations. It evaluates decoder-based vision-language models and detectors, revealing that CLIP-like models remain nearly blind to subtle misalignments, that detectors over-score early sentences, and that detectors exhibit self-preference. The study demonstrates strong correlations between AlignBench performance and broader alignment benchmarks while highlighting unique challenges in real-time localization of hallucinated segments. Overall, AlignBench provides a robust, scalable framework to diagnose and improve image-text alignment capabilities across a wide range of models and domains.

Abstract

Assessing image-text alignment models such as CLIP is crucial for bridging visual and linguistic representations. Yet existing benchmarks rely on rule-based perturbations or short captions, limiting their ability to measure fine-grained alignment. We introduce AlignBench, a benchmark that provides a new indicator of image-text alignment by evaluating detailed image-caption pairs generated by diverse image-to-text and text-to-image models. Each sentence is annotated for correctness, enabling direct assessment of VLMs as alignment evaluators. Benchmarking a wide range of decoder-based VLMs reveals three key findings: (i) CLIP-based models, even those tailored for compositional reasoning, remain nearly blind; (ii) detectors systematically over-score early sentences; and (iii) they show strong self-preference, favoring their own outputs and harming detection performance. Our project page will be available at https://dahlian00.github.io/AlignBench/.

AlignBench: Benchmarking Fine-Grained Image-Text Alignment with Synthetic Image-Caption Pairs

TL;DR

<3-5 sentence high-level summary> AlignBench introduces a large-scale, fine-grained benchmark for image-text alignment built from synthetic captions generated by diverse captioners and text-to-image models, with sentence-level correctness and hallucination-type annotations. It evaluates decoder-based vision-language models and detectors, revealing that CLIP-like models remain nearly blind to subtle misalignments, that detectors over-score early sentences, and that detectors exhibit self-preference. The study demonstrates strong correlations between AlignBench performance and broader alignment benchmarks while highlighting unique challenges in real-time localization of hallucinated segments. Overall, AlignBench provides a robust, scalable framework to diagnose and improve image-text alignment capabilities across a wide range of models and domains.

Abstract

Assessing image-text alignment models such as CLIP is crucial for bridging visual and linguistic representations. Yet existing benchmarks rely on rule-based perturbations or short captions, limiting their ability to measure fine-grained alignment. We introduce AlignBench, a benchmark that provides a new indicator of image-text alignment by evaluating detailed image-caption pairs generated by diverse image-to-text and text-to-image models. Each sentence is annotated for correctness, enabling direct assessment of VLMs as alignment evaluators. Benchmarking a wide range of decoder-based VLMs reveals three key findings: (i) CLIP-based models, even those tailored for compositional reasoning, remain nearly blind; (ii) detectors systematically over-score early sentences; and (iii) they show strong self-preference, favoring their own outputs and harming detection performance. Our project page will be available at https://dahlian00.github.io/AlignBench/.

Paper Structure

This paper contains 24 sections, 26 figures, 14 tables.

Figures (26)

  • Figure 1: We introduce a novel benchmark, AlignBench, which evaluates the VLM's ability for text-image alignment. We employ state-of-the-art Image-to-Text and Text-to-Image models to create synthetic image-caption pairs with or without subtle hallucinations. Misaligned words are highlighted in red. Using this dataset, we benchmark diverse VLMs to assess their ability to understand the alignment of image-sentence pairs. We find that subtle hallucinations generated by multimodal models can be hard to detect, even by state-of-the-art VLMs.
  • Figure 2: AlignBench spans diverse Image2Text (i.e., Captioner) and Text2Image models, diverse image domains, and provides high-quality annotations enriched with hallucination-type labels for deep analysis. The rightmost figure presents the example of annotations. We first conduct sentence-level correctness annotation and further annotate the segment of hallucination and its type label.
  • Figure 3: Examples of hallucinated sentences in AlignBench. The hallucinated portions are often subtle, requiring fine-grained image-text alignment ability to detect them.
  • Figure 4: Left: Ratio of incorrect sentences by position; all captioners make fewer errors at the first position. Different colors indicate different positions. Right: Number of unaligned sentences per category; most mistakes occur in attributes and text.
  • Figure 5: Examples of incorrect sentences with detectors’ correctness scores. Higher scores indicate greater confidence in correctness. Detectors are prone to being overconfident in these examples. We highlight detectors’ errors in red within the text and mark the grounded incorrect regions in the image with orange boxes.
  • ...and 21 more figures