Table of Contents
Fetching ...

Some Modalities are More Equal Than Others: Decoding and Architecting Multimodal Integration in MLLMs

Tianle Chen, Chaitanya Chakka, Arjun Reddy Akula, Xavier Thomas, Deepti Ghadiyaram

TL;DR

MLLMs often falter when modalities clash, over-relying on text and failing robust cross-modal reasoning. The authors introduce MMA-Bench to systematically probe modality-specific misalignment and use black-box and white-box analyses to reveal text dominance and brittle integration. They propose a modality-aware fine-tuning strategy using LoRA that teaches models to prioritize the correct modality, yielding substantial gains in cross-modal grounding and improved zero-shot performance on AVHBench. The work combines a rigorous dataset pipeline, interpretability methods, and a practical tuning approach to move toward more reliable cross-modal reasoning in MLLMs.

Abstract

Despite remarkable advancements in Multimodal Large Language Models (MLLMs), a fundamental question remains: are MLLMs robust to contradicting modalities? To rigorously study this, we introduce MMA-Bench comprising videos and tasks that probe a model's reliance on specific modalities. Using black-box and white-box interpretability techniques, we provide a critical analysis of the brittleness of both open- and closed-sourced MLLMs. We show that current MLLMs struggle under misaligned audio-visual pairs and simple misleading text, thereby lacking robust multi-modal reasoning. Building on these findings, we propose a modality alignment tuning strategy to teach the model when to prioritize, leverage, or ignore specific modality cues. Through extensive experiments and analysis, we show that our alignment tuning yields demonstrably stronger multimodal grounding. This work provides both interpretability tools and a clear path toward developing MLLMs with intrinsically reliable cross-modal reasoning. Code and dataset will be publicly available.

Some Modalities are More Equal Than Others: Decoding and Architecting Multimodal Integration in MLLMs

TL;DR

MLLMs often falter when modalities clash, over-relying on text and failing robust cross-modal reasoning. The authors introduce MMA-Bench to systematically probe modality-specific misalignment and use black-box and white-box analyses to reveal text dominance and brittle integration. They propose a modality-aware fine-tuning strategy using LoRA that teaches models to prioritize the correct modality, yielding substantial gains in cross-modal grounding and improved zero-shot performance on AVHBench. The work combines a rigorous dataset pipeline, interpretability methods, and a practical tuning approach to move toward more reliable cross-modal reasoning in MLLMs.

Abstract

Despite remarkable advancements in Multimodal Large Language Models (MLLMs), a fundamental question remains: are MLLMs robust to contradicting modalities? To rigorously study this, we introduce MMA-Bench comprising videos and tasks that probe a model's reliance on specific modalities. Using black-box and white-box interpretability techniques, we provide a critical analysis of the brittleness of both open- and closed-sourced MLLMs. We show that current MLLMs struggle under misaligned audio-visual pairs and simple misleading text, thereby lacking robust multi-modal reasoning. Building on these findings, we propose a modality alignment tuning strategy to teach the model when to prioritize, leverage, or ignore specific modality cues. Through extensive experiments and analysis, we show that our alignment tuning yields demonstrably stronger multimodal grounding. This work provides both interpretability tools and a clear path toward developing MLLMs with intrinsically reliable cross-modal reasoning. Code and dataset will be publicly available.

Paper Structure

This paper contains 40 sections, 2 equations, 25 figures, 11 tables.

Figures (25)

  • Figure 1: We propose MMA-Bench to expose how MLLMs behave when sight, sound, and language conflict. Each example presents a controlled modality (e.g., audio, video, or text) conflict and asks two modality-specific questions - one about the video and one about the audio. Correct answers differ across modalities, forcing the model to attend to the reliable modality. These structured contradictions reveal if MLLMs are truly multi-modal or take shortcuts during cross-modal reasoning tasks.
  • Figure 2: Automated data curation pipeline for building MMA-Bench. Our two–stage pipeline converts raw AudioSet gemmeke2017audio into clean, semantically aligned audio–video samples. Stage 1 simplifies the ontology by pruning action-less, ambiguous (e.g., audio event "hiss" could be associated with "stream" or "cat" ), and restricted classes (e.g.,"heart murmur"). Stage 2 retains videos based on the simplified audio events. Here, only clips where the audible event is clearly produced by a visible object are retained yielding a high-quality subset which is further post-processed (Sec. \ref{['sec:mmabench']}).
  • Figure 3: Verification of audio-visual semantic alignment. After automated pruning (Fig. \ref{['fig:data_curationpipeline']}), each clip undergoes $4$ simple yes/no consistency checks. A sample is kept only if it passes all checks.
  • Figure 4: Unimodal probing under visual and auditory ablation. Each subplot reports classification accuracy under visual-focused (left bars) and audio-focused (right bars) prompts. "Audio removed" replaces the sound track with silence, while "Frames zeroed" replaces all video frames with black images.
  • Figure 5: Visual- and audio-focused prompts used for understanding modality sensitivity of MLLMs . Each prompt encourages the model to focus on one modality and allows a controlled comparison of visual and auditory reasoning within MLLMs.
  • ...and 20 more figures