Table of Contents
Fetching ...

CLASH: A Benchmark for Cross-Modal Contradiction Detection

Teodora Popordanoska, Jiameng Li, Matthew B. Blaschko

TL;DR

CLASH introduces a large-scale cross-modal contradiction benchmark that pairs MS COCO images with controlled caption contradictions, enabling spontaneous detection of conflicts across object-level and attribute-level changes. The dataset supports both multiple-choice and open-ended questions, with approximately 15k training samples and a human-verified diagnostic test set, and uses a rigorous, multi-stage generation and validation pipeline. Evaluation reveals a sizable gap between top closed-source models and open-source models in detecting cross-modal conflicts, uncovers systematic modality biases, and demonstrates that targeted LoRA finetuning on high-quality CLASH data can dramatically improve conflict detection. The work argues that dual-ground-truth evaluation is essential for robust multimodal systems and provides a path toward more reliable AI through modality-robust reasoning.

Abstract

Contradictory multimodal inputs are common in real-world settings, yet existing benchmarks typically assume input consistency and fail to evaluate cross-modal contradiction detection - a fundamental capability for preventing hallucinations and ensuring reliability. We introduce CLASH, a novel benchmark for multimodal contradiction detection, featuring COCO images paired with contradictory captions containing controlled object-level or attribute-level contradictions. The samples include targeted questions evaluated in both multiple-choice and open-ended formats. The benchmark provides an extensive fine-tuning set filtered through automated quality checks, alongside a smaller human-verified diagnostic set. Our analysis of state-of-the-art models reveals substantial limitations in recognizing cross-modal conflicts, exposing systematic modality biases and category-specific weaknesses. Furthermore, we empirically demonstrate that targeted fine-tuning on CLASH substantially enhances conflict detection capabilities.

CLASH: A Benchmark for Cross-Modal Contradiction Detection

TL;DR

CLASH introduces a large-scale cross-modal contradiction benchmark that pairs MS COCO images with controlled caption contradictions, enabling spontaneous detection of conflicts across object-level and attribute-level changes. The dataset supports both multiple-choice and open-ended questions, with approximately 15k training samples and a human-verified diagnostic test set, and uses a rigorous, multi-stage generation and validation pipeline. Evaluation reveals a sizable gap between top closed-source models and open-source models in detecting cross-modal conflicts, uncovers systematic modality biases, and demonstrates that targeted LoRA finetuning on high-quality CLASH data can dramatically improve conflict detection. The work argues that dual-ground-truth evaluation is essential for robust multimodal systems and provides a path toward more reliable AI through modality-robust reasoning.

Abstract

Contradictory multimodal inputs are common in real-world settings, yet existing benchmarks typically assume input consistency and fail to evaluate cross-modal contradiction detection - a fundamental capability for preventing hallucinations and ensuring reliability. We introduce CLASH, a novel benchmark for multimodal contradiction detection, featuring COCO images paired with contradictory captions containing controlled object-level or attribute-level contradictions. The samples include targeted questions evaluated in both multiple-choice and open-ended formats. The benchmark provides an extensive fine-tuning set filtered through automated quality checks, alongside a smaller human-verified diagnostic set. Our analysis of state-of-the-art models reveals substantial limitations in recognizing cross-modal conflicts, exposing systematic modality biases and category-specific weaknesses. Furthermore, we empirically demonstrate that targeted fine-tuning on CLASH substantially enhances conflict detection capabilities.

Paper Structure

This paper contains 48 sections, 11 figures, 16 tables.

Figures (11)

  • Figure 1: Left: Three-stage pipeline generates conflicting image–text pairs from MS COCO with targeted questions. Right: Examples in Clash, showing object and attribute contradictions. Models are evaluated on the ability to detect conflicts in multiple-choice or open-ended format.
  • Figure 2: Diagnostic set statistics. Left: Object category distribution (655 samples). Middle: Attribute distribution (634 samples). Right: Distribution of questions by their first three words.
  • Figure 3: Dataset generation pipeline. Starting from MS COCO captions, the pipeline identifies change type (object vs. attribute) and applies corresponding validation checks. Validated changes proceed to question generation before human verification determines final dataset acceptance. Examples show object change (snowboarder $\rightarrow$ skier) and attribute change (orange $\rightarrow$ blue).
  • Figure 4: Word frequency analysis. Left: Most frequent words appearing in object contradiction pairs. Right: Most common words from attribute contradiction pairs.
  • Figure 5: Category-specific performance on multiple-choice task. We show contradiction detection accuracy across object categories (left) and attribute categories (right) for four representative models.
  • ...and 6 more figures