CLASH: A Benchmark for Cross-Modal Contradiction Detection
Teodora Popordanoska, Jiameng Li, Matthew B. Blaschko
TL;DR
CLASH introduces a large-scale cross-modal contradiction benchmark that pairs MS COCO images with controlled caption contradictions, enabling spontaneous detection of conflicts across object-level and attribute-level changes. The dataset supports both multiple-choice and open-ended questions, with approximately 15k training samples and a human-verified diagnostic test set, and uses a rigorous, multi-stage generation and validation pipeline. Evaluation reveals a sizable gap between top closed-source models and open-source models in detecting cross-modal conflicts, uncovers systematic modality biases, and demonstrates that targeted LoRA finetuning on high-quality CLASH data can dramatically improve conflict detection. The work argues that dual-ground-truth evaluation is essential for robust multimodal systems and provides a path toward more reliable AI through modality-robust reasoning.
Abstract
Contradictory multimodal inputs are common in real-world settings, yet existing benchmarks typically assume input consistency and fail to evaluate cross-modal contradiction detection - a fundamental capability for preventing hallucinations and ensuring reliability. We introduce CLASH, a novel benchmark for multimodal contradiction detection, featuring COCO images paired with contradictory captions containing controlled object-level or attribute-level contradictions. The samples include targeted questions evaluated in both multiple-choice and open-ended formats. The benchmark provides an extensive fine-tuning set filtered through automated quality checks, alongside a smaller human-verified diagnostic set. Our analysis of state-of-the-art models reveals substantial limitations in recognizing cross-modal conflicts, exposing systematic modality biases and category-specific weaknesses. Furthermore, we empirically demonstrate that targeted fine-tuning on CLASH substantially enhances conflict detection capabilities.
