Diagnosing and Mitigating Modality Interference in Multimodal Large Language Models
Rui Cai, Bangzheng Li, Xiaofei Wen, Muhao Chen, Zhe Zhao
TL;DR
The paper tackles modality interference in multimodal large language models by formalizing cross-modality competency and diagnosing reliance on irrelevant signals through a causal, perturbation-based framework. It introduces a unified fine-tuning approach that combines perturbation-based data augmentation (heuristic and adversarial), modality-specific masking, and output-level consistency regularization to enforce stable, task-relevant cross-modal reasoning. Empirical results across multiple model families and benchmarks show Pareto-optimal improvements in unimodal robustness and multimodal generalization, including strong out-of-distribution resilience to real-world perturbations. The work advances reliable multimodal reasoning by explicitly constraining cross-modal influences, with broad implications for deploying MLLMs in noisy, real-world environments.
Abstract
Multimodal Large Language Models demonstrate strong performance on multimodal benchmarks, yet often exhibit poor robustness when exposed to spurious modality interference, such as irrelevant text in vision understanding, or irrelevant visual content in question answering. At its core, modality interference refers to cases where spurious signals from non-essential modalities distort model decisions, which we systematically analyze through causal, perturbation-based diagnostic experiments. To address this problem, we propose a unified finetuning framework that combines heuristic and adversarial perturbation-based data augmentation with output-level consistency regularization between original and perturbed inputs. Extensive experiments across image-heavy, text-heavy, and multimodal benchmarks, spanning multiple MLLM architectures and model scales, demonstrate consistent improvements in unimodal robustness and generalization, while improving standard multimodal performance.
