MiSCHiEF: A Benchmark in Minimal-Pairs of Safety and Culture for Holistic Evaluation of Fine-Grained Image-Caption Alignment

Sagarika Banerjee; Tangatar Madi; Advait Swaminathan; Nguyen Dao Minh Anh; Shivank Garg; Kevin Zhu; Vasu Sharma

MiSCHiEF: A Benchmark in Minimal-Pairs of Safety and Culture for Holistic Evaluation of Fine-Grained Image-Caption Alignment

Sagarika Banerjee, Tangatar Madi, Advait Swaminathan, Nguyen Dao Minh Anh, Shivank Garg, Kevin Zhu, Vasu Sharma

TL;DR

It is found that models generally perform better at confirming the correct image-caption pair than rejecting incorrect ones, and models achieve higher accuracy when selecting the correct caption from two highly similar captions for a given image, compared to the converse task.

Abstract

Fine-grained image-caption alignment is crucial for vision-language models (VLMs), especially in socially critical contexts such as identifying real-world risk scenarios or distinguishing cultural proxies, where correct interpretation hinges on subtle visual or linguistic clues and where minor misinterpretations can lead to significant real-world consequences. We present MiSCHiEF, a set of two benchmarking datasets based on a contrastive pair design in the domains of safety (MiS) and culture (MiC), and evaluate four VLMs on tasks requiring fine-grained differentiation of paired images and captions. In both datasets, each sample contains two minimally differing captions and corresponding minimally differing images. In MiS, the image-caption pairs depict a safe and an unsafe scenario, while in MiC, they depict cultural proxies in two distinct cultural contexts. We find that models generally perform better at confirming the correct image-caption pair than rejecting incorrect ones. Additionally, models achieve higher accuracy when selecting the correct caption from two highly similar captions for a given image, compared to the converse task. The results, overall, highlight persistent modality misalignment challenges in current VLMs, underscoring the difficulty of precise cross-modal grounding required for applications with subtle semantic and visual distinctions.

MiSCHiEF: A Benchmark in Minimal-Pairs of Safety and Culture for Holistic Evaluation of Fine-Grained Image-Caption Alignment

TL;DR

Abstract

Paper Structure (21 sections, 21 figures, 2 tables)

This paper contains 21 sections, 21 figures, 2 tables.

Introduction
Experiments
MiS and MiC Dataset Curation
Results
Discussion
Conclusion
Limitations
Related Works
Visuo-Linguistic Compositional Reasoning
Safety Benchmarks for Vision-Language Models
Cultural Reasoning in AI
Implementation Details
Dataset Statistics
Prompts
Caption Pair Generation Prompts for MiC
...and 6 more sections

Figures (21)

Figure 1: Curation pipeline for MiS and MiC: LLM-generated caption pairs are verified, used for image generation and editing, and manually refined. The complete generation pipeline is detailed in Appendix \ref{['curate']}. Example entries from the dataset are shown in Fig. \ref{['fig:sample']}.
Figure 2: Examples from MiSCHiEF illustrating minimal pairs in MiS and MiC.
Figure 3: Category wise Distribution for MiS and MiC
Figure :
Figure :
...and 16 more figures

MiSCHiEF: A Benchmark in Minimal-Pairs of Safety and Culture for Holistic Evaluation of Fine-Grained Image-Caption Alignment

TL;DR

Abstract

MiSCHiEF: A Benchmark in Minimal-Pairs of Safety and Culture for Holistic Evaluation of Fine-Grained Image-Caption Alignment

Authors

TL;DR

Abstract

Table of Contents

Figures (21)