Table of Contents
Fetching ...

Omni-Modal Dissonance Benchmark: Systematically Breaking Modality Consensus to Probe Robustness and Calibrated Abstention

Zabir Al Nazi, Shubhashis Roy Dipta, Md Rizwan Parvez

Abstract

Existing omni-modal benchmarks attempt to measure modality-specific contributions, but their measurements are confounded: naturally co-occurring modalities carry correlated yet unequal information, making it unclear whether results reflect true modality reliance or information asymmetry. We introduce OMD-Bench, where all modalities are initially congruent - each presenting the same anchor, an object or event independently perceivable through video, audio, and text - which we then systematically corrupt to isolate each modality's contribution. We also evaluate calibrated abstention: whether models appropriately refrain from answering when evidence is conflicting. The benchmark comprises 4,080 instances spanning 27 anchors across eight corruption conditions. Evaluating ten omni-modal models under zero-shot and chain-of-thought prompting, we find that models over-abstain when two modalities are corrupted yet under-abstain severely when all three are, while maintaining high confidence (~60-100%) even under full corruption. Chain-of-thought prompting improves abstention alignment with human judgment but amplifies overconfidence rather than mitigating it. OMD-Bench provides a diagnostic benchmark for diagnosing modality reliance, robustness to cross-modal inconsistency, and uncertainty calibration in omni-modal systems.

Omni-Modal Dissonance Benchmark: Systematically Breaking Modality Consensus to Probe Robustness and Calibrated Abstention

Abstract

Existing omni-modal benchmarks attempt to measure modality-specific contributions, but their measurements are confounded: naturally co-occurring modalities carry correlated yet unequal information, making it unclear whether results reflect true modality reliance or information asymmetry. We introduce OMD-Bench, where all modalities are initially congruent - each presenting the same anchor, an object or event independently perceivable through video, audio, and text - which we then systematically corrupt to isolate each modality's contribution. We also evaluate calibrated abstention: whether models appropriately refrain from answering when evidence is conflicting. The benchmark comprises 4,080 instances spanning 27 anchors across eight corruption conditions. Evaluating ten omni-modal models under zero-shot and chain-of-thought prompting, we find that models over-abstain when two modalities are corrupted yet under-abstain severely when all three are, while maintaining high confidence (~60-100%) even under full corruption. Chain-of-thought prompting improves abstention alignment with human judgment but amplifies overconfidence rather than mitigating it. OMD-Bench provides a diagnostic benchmark for diagnosing modality reliance, robustness to cross-modal inconsistency, and uncertainty calibration in omni-modal systems.

Paper Structure

This paper contains 78 sections, 1 equation, 16 figures, 16 tables.

Figures (16)

  • Figure 1: Overview of OMD-Bench. All modalities initially depict the same anchor (a dog). We systematically corrupt subsets of modalities, replacing them with different anchor's content, and evaluate whether models can still answer correctly or appropriately abstain.
  • Figure 2: MCQ vs. open-ended accuracy across corruption levels. Shaded area shows the ZS performance gap. MCQ values are from Table \ref{['tab:accuracy']}; open-ended values are averaged across judges and splits.
  • Figure 3: Open-ended accuracy by corruption level, split by judge. Each line is a (model, prompt) configuration averaged across both data splits.
  • Figure 4: Per-condition open-ended accuracy (%) averaged across judges and splits. Rows show each (model, prompt) configuration. White lines separate corruption levels.
  • Figure 5: E1: Wrong factual answer - piano ($k{=}0$, Gemini 2.5, real).
  • ...and 11 more figures