Table of Contents
Fetching ...

Diagnosing and Mitigating Modality Interference in Multimodal Large Language Models

Rui Cai, Bangzheng Li, Xiaofei Wen, Muhao Chen, Zhe Zhao

TL;DR

The paper tackles modality interference in multimodal large language models by formalizing cross-modality competency and diagnosing reliance on irrelevant signals through a causal, perturbation-based framework. It introduces a unified fine-tuning approach that combines perturbation-based data augmentation (heuristic and adversarial), modality-specific masking, and output-level consistency regularization to enforce stable, task-relevant cross-modal reasoning. Empirical results across multiple model families and benchmarks show Pareto-optimal improvements in unimodal robustness and multimodal generalization, including strong out-of-distribution resilience to real-world perturbations. The work advances reliable multimodal reasoning by explicitly constraining cross-modal influences, with broad implications for deploying MLLMs in noisy, real-world environments.

Abstract

Multimodal Large Language Models demonstrate strong performance on multimodal benchmarks, yet often exhibit poor robustness when exposed to spurious modality interference, such as irrelevant text in vision understanding, or irrelevant visual content in question answering. At its core, modality interference refers to cases where spurious signals from non-essential modalities distort model decisions, which we systematically analyze through causal, perturbation-based diagnostic experiments. To address this problem, we propose a unified finetuning framework that combines heuristic and adversarial perturbation-based data augmentation with output-level consistency regularization between original and perturbed inputs. Extensive experiments across image-heavy, text-heavy, and multimodal benchmarks, spanning multiple MLLM architectures and model scales, demonstrate consistent improvements in unimodal robustness and generalization, while improving standard multimodal performance.

Diagnosing and Mitigating Modality Interference in Multimodal Large Language Models

TL;DR

The paper tackles modality interference in multimodal large language models by formalizing cross-modality competency and diagnosing reliance on irrelevant signals through a causal, perturbation-based framework. It introduces a unified fine-tuning approach that combines perturbation-based data augmentation (heuristic and adversarial), modality-specific masking, and output-level consistency regularization to enforce stable, task-relevant cross-modal reasoning. Empirical results across multiple model families and benchmarks show Pareto-optimal improvements in unimodal robustness and multimodal generalization, including strong out-of-distribution resilience to real-world perturbations. The work advances reliable multimodal reasoning by explicitly constraining cross-modal influences, with broad implications for deploying MLLMs in noisy, real-world environments.

Abstract

Multimodal Large Language Models demonstrate strong performance on multimodal benchmarks, yet often exhibit poor robustness when exposed to spurious modality interference, such as irrelevant text in vision understanding, or irrelevant visual content in question answering. At its core, modality interference refers to cases where spurious signals from non-essential modalities distort model decisions, which we systematically analyze through causal, perturbation-based diagnostic experiments. To address this problem, we propose a unified finetuning framework that combines heuristic and adversarial perturbation-based data augmentation with output-level consistency regularization between original and perturbed inputs. Extensive experiments across image-heavy, text-heavy, and multimodal benchmarks, spanning multiple MLLM architectures and model scales, demonstrate consistent improvements in unimodal robustness and generalization, while improving standard multimodal performance.

Paper Structure

This paper contains 35 sections, 12 equations, 6 figures, 17 tables.

Figures (6)

  • Figure 1: Modality Interference in MLLMs. We visualize the performance of 15 MLLMs using radar charts, where the polygon area signifies model capability. Left (Vision Tasks): When vision tasks are interfered with by Misleading Text (red), models exhibit a severe performance collapse, shrinking towards the center. Right (Text Tasks): Conversely, text reasoning tasks suffer from Visual Noise (brown/irrelevant image), causing a noticeable degradation compared to the no-interference baseline.
  • Figure 2: Causal graph illustrating modality interference in our perturbation-based evaluation analysis. Controlled interventions (heuristic) perturb either the image or text inputs, affecting their intermediate representations and ultimately the model prediction.
  • Figure 3: Overview of our proposed framework.
  • Figure 4: Performance degradation under irrelevant perturbations reveals modality interference in MLLMs. Left: Mini-ImageNet (image-heavy) with Original input, Unrelated Facts, and Misleading Descriptions. Right: OpenBookQA (text-heavy) with Random Pixels, Full Black Canvas, and Irrelevant Real Images. Misleading descriptions most severely affect image-heavy tasks, while irrelevant real images cause the largest drop in text-heavy reasoning.
  • Figure 5: Task-wise robustness under perturbation. Each radar chart shows model accuracy (%) across Mini-ImageNet, Caltech-101 (image-heavy) and OpenBookQA, MMLU (text-heavy) under various perturbations. (a) uses raw accuracy of different pretrained MLLMs directly. (b–d) are normalized relative accuracy of each MLLMs. (We normalize each absolute accuracy into relative accuracy, which refers to absolute tested accuracy / accuracy of vanilla MLLMs in origin setting without perturbation.)
  • ...and 1 more figures