Table of Contents
Fetching ...

When One Modality Sabotages the Others: A Diagnostic Lens on Multimodal Reasoning

Chenyu Zhang, Minsol Kim, Shohreh Ghorbani, Jingyao Wu, Rosalind Picard, Patricia Maes, Paul Pu Liang

TL;DR

The paper tackles the opacity of multimodal reasoning in large-language models by introducing modality sabotage, a diagnostic failure mode where a high-confidence unimodal error overrides other evidence. It presents a model-agnostic, agent-based fusion framework that treats each modality as an independent decision-maker and collects per-modality votes and self-assessments to audit the fusion. The authors evaluate on MER, MELD, and IEMOCAP emotion recognition benchmarks, revealing dataset- and backbone-dependent reliability profiles and demonstrating recoverable uncertainty via Top-$k$ analysis. The framework offers a practical scaffold for auditing and debugging multimodal reasoning and can guide calibration and intervention strategies to mitigate failures.

Abstract

Despite rapid growth in multimodal large language models (MLLMs), their reasoning traces remain opaque: it is often unclear which modality drives a prediction, how conflicts are resolved, or when one stream dominates. In this paper, we introduce modality sabotage, a diagnostic failure mode in which a high-confidence unimodal error overrides other evidence and misleads the fused result. To analyze such dynamics, we propose a lightweight, model-agnostic evaluation layer that treats each modality as an agent, producing candidate labels and a brief self-assessment used for auditing. A simple fusion mechanism aggregates these outputs, exposing contributors (modalities supporting correct outcomes) and saboteurs (modalities that mislead). Applying our diagnostic layer in a case study on multimodal emotion recognition benchmarks with foundation models revealed systematic reliability profiles, providing insight into whether failures may arise from dataset artifacts or model limitations. More broadly, our framework offers a diagnostic scaffold for multimodal reasoning, supporting principled auditing of fusion dynamics and informing possible interventions.

When One Modality Sabotages the Others: A Diagnostic Lens on Multimodal Reasoning

TL;DR

The paper tackles the opacity of multimodal reasoning in large-language models by introducing modality sabotage, a diagnostic failure mode where a high-confidence unimodal error overrides other evidence. It presents a model-agnostic, agent-based fusion framework that treats each modality as an independent decision-maker and collects per-modality votes and self-assessments to audit the fusion. The authors evaluate on MER, MELD, and IEMOCAP emotion recognition benchmarks, revealing dataset- and backbone-dependent reliability profiles and demonstrating recoverable uncertainty via Top- analysis. The framework offers a practical scaffold for auditing and debugging multimodal reasoning and can guide calibration and intervention strategies to mitigate failures.

Abstract

Despite rapid growth in multimodal large language models (MLLMs), their reasoning traces remain opaque: it is often unclear which modality drives a prediction, how conflicts are resolved, or when one stream dominates. In this paper, we introduce modality sabotage, a diagnostic failure mode in which a high-confidence unimodal error overrides other evidence and misleads the fused result. To analyze such dynamics, we propose a lightweight, model-agnostic evaluation layer that treats each modality as an agent, producing candidate labels and a brief self-assessment used for auditing. A simple fusion mechanism aggregates these outputs, exposing contributors (modalities supporting correct outcomes) and saboteurs (modalities that mislead). Applying our diagnostic layer in a case study on multimodal emotion recognition benchmarks with foundation models revealed systematic reliability profiles, providing insight into whether failures may arise from dataset artifacts or model limitations. More broadly, our framework offers a diagnostic scaffold for multimodal reasoning, supporting principled auditing of fusion dynamics and informing possible interventions.

Paper Structure

This paper contains 12 sections, 2 equations, 2 figures, 1 table.

Figures (2)

  • Figure 1: Each modality (T, A, V) and a joint view (TAV) agent outputs classification labels with confidence. A simple fusion aggregates these into a ranked prediction, enabling attribution of contributors vs. saboteurs. The callout highlights high-confidence unimodal errors that mislead the fused decision (modality sabotage); see Section \ref{['sec:method']} for details.
  • Figure 2: Left heatmap: unimodal accuracy for Text (T), Audio (A), Vision (V), and joint view (TAV), highlighting differences across datasets. Right heatmap: proportion of cases where a modality sabotages the fused decision (high-confidence error flipping Top-1 at threshold 70), where each values show #cases/total (rate%).