AlignBench: Benchmarking Fine-Grained Image-Text Alignment with Synthetic Image-Caption Pairs
Kuniaki Saito, Risa Shinoda, Shohei Tanaka, Tosho Hirasawa, Fumio Okura, Yoshitaka Ushiku
TL;DR
<3-5 sentence high-level summary> AlignBench introduces a large-scale, fine-grained benchmark for image-text alignment built from synthetic captions generated by diverse captioners and text-to-image models, with sentence-level correctness and hallucination-type annotations. It evaluates decoder-based vision-language models and detectors, revealing that CLIP-like models remain nearly blind to subtle misalignments, that detectors over-score early sentences, and that detectors exhibit self-preference. The study demonstrates strong correlations between AlignBench performance and broader alignment benchmarks while highlighting unique challenges in real-time localization of hallucinated segments. Overall, AlignBench provides a robust, scalable framework to diagnose and improve image-text alignment capabilities across a wide range of models and domains.
Abstract
Assessing image-text alignment models such as CLIP is crucial for bridging visual and linguistic representations. Yet existing benchmarks rely on rule-based perturbations or short captions, limiting their ability to measure fine-grained alignment. We introduce AlignBench, a benchmark that provides a new indicator of image-text alignment by evaluating detailed image-caption pairs generated by diverse image-to-text and text-to-image models. Each sentence is annotated for correctness, enabling direct assessment of VLMs as alignment evaluators. Benchmarking a wide range of decoder-based VLMs reveals three key findings: (i) CLIP-based models, even those tailored for compositional reasoning, remain nearly blind; (ii) detectors systematically over-score early sentences; and (iii) they show strong self-preference, favoring their own outputs and harming detection performance. Our project page will be available at https://dahlian00.github.io/AlignBench/.
