Table of Contents
Fetching ...

Good Scores, Bad Data: A Metric for Multimodal Coherence

Vasundra Srinivasan

Abstract

Multimodal AI systems are evaluated by downstream task accuracy, but high accuracy does not mean the underlying data is coherent. A model can score well on Visual Question Answering (VQA) while its inputs contradict each other. We introduce the Multimodal Coherence Score (MCS), a metric that evaluates fusion quality independent of any downstream model. MCS decomposes coherence into four dimensions, identity, spatial, semantic, and decision, with weights learned via Nelder-Mead optimization. We evaluate on 1,000 Visual Genome images using DETR, CLIP, and ViLT, and validate on 150 COCO images with no retraining. Across three fusion architectures, MCS discriminates quality with higher sensitivity than task accuracy alone (Spearman rho = 0.093 vs. 0.071). Perturbation experiments confirm each dimension responds independently to its failure mode with zero cross-talk. MCS is lightweight, requires no human annotation, and tells you not just that something broke, but what broke.

Good Scores, Bad Data: A Metric for Multimodal Coherence

Abstract

Multimodal AI systems are evaluated by downstream task accuracy, but high accuracy does not mean the underlying data is coherent. A model can score well on Visual Question Answering (VQA) while its inputs contradict each other. We introduce the Multimodal Coherence Score (MCS), a metric that evaluates fusion quality independent of any downstream model. MCS decomposes coherence into four dimensions, identity, spatial, semantic, and decision, with weights learned via Nelder-Mead optimization. We evaluate on 1,000 Visual Genome images using DETR, CLIP, and ViLT, and validate on 150 COCO images with no retraining. Across three fusion architectures, MCS discriminates quality with higher sensitivity than task accuracy alone (Spearman rho = 0.093 vs. 0.071). Perturbation experiments confirm each dimension responds independently to its failure mode with zero cross-talk. MCS is lightweight, requires no human annotation, and tells you not just that something broke, but what broke.

Paper Structure

This paper contains 18 sections, 5 equations, 6 figures, 4 tables.

Figures (6)

  • Figure 1: Visual Genome image #2320169. Left: what a DETR detector sees ("zebra" for each animal). Right: what the dataset says ("animal," "bunch," "stripe," and "zebra is next to zebra" repeated ten times). A VQA (Visual Question Answering) system returns the correct answer. The benchmark score looks fine. The data is not.
  • Figure 2: The MCS framework. A multimodal event (left) is evaluated along four independent dimensions (center), each weighted by learned coefficients, producing a single diagnostic score (right). The decomposition identifies which dimension to address.
  • Figure 3: MCS dimension scores across three fusion architectures on Visual Genome ($n = 1{,}000$). Contract-enforced fusion leads on IC and SpC. DC is nearly identical across all three, illustrating that downstream accuracy alone cannot distinguish fusion quality.
  • Figure 4: Left: perturbation impact by type. Only the diagonal shows degradation, confirming zero cross-talk between dimensions. Right: degradation scales linearly with corruption rate. All dimensions exceed the 20% significance threshold at 50% perturbation.
  • Figure 5: Left: learned dimension weights. SC carries 72.2% of the signal, indicating that the text-image gap is the primary failure mode in Visual Genome. Right: VG-learned weights transfer to COCO and Open Images with zero retraining.
  • ...and 1 more figures