Table of Contents
Fetching ...

When Eyes and Ears Disagree: Can MLLMs Discern Audio-Visual Confusion?

Qilang Ye, Wei Zeng, Meng Liu, Jie Zhang, Yupeng Hu, Zitong Yu, Yu Zhou

TL;DR

This work investigates whether Multimodal Large Language Models can discern audio-visual confusion when audio is absent or altered. It introduces AV-ConfuseBench to systematically probe audio-visual asymmetry and proposes RL-CoMM, a two-stage reinforcement-learning framework that leverages a Large Audio Language Model as a reference and employs Step-wise Reasoning Reward and Answer-centered Confidence Optimization to balance audio and visual reasoning. Across AVQA and AVH tasks, RL-CoMM yields 10–30% accuracy gains over baselines with limited training data, demonstrating more robust audio-visual understanding and reduced cross-modal hallucinations. The methods offer a practical path toward more reliable AV reasoning in real-world multimedia systems, enabling better handling of asymmetric AV information.

Abstract

Can Multimodal Large Language Models (MLLMs) discern confused objects that are visually present but audio-absent? To study this, we introduce a new benchmark, AV-ConfuseBench, which simulates an ``Audio-Visual Confusion'' scene by modifying the corresponding sound of an object in the video, e.g., mute the sounding object and ask MLLMs Is there a/an muted-object sound''. Experimental results reveal that MLLMs, such as Qwen2.5-Omni and Gemini 2.5, struggle to discriminate non-existent audio due to visually dominated reasoning. Motivated by this observation, we introduce RL-CoMM, a Reinforcement Learning-based Collaborative Multi-MLLM that is built upon the Qwen2.5-Omni foundation. RL-CoMM includes two stages: 1) To alleviate visually dominated ambiguities, we introduce an external model, a Large Audio Language Model (LALM), as the reference model to generate audio-only reasoning. Then, we design a Step-wise Reasoning Reward function that enables MLLMs to self-improve audio-visual reasoning with the audio-only reference. 2) To ensure an accurate answer prediction, we introduce Answer-centered Confidence Optimization to reduce the uncertainty of potential heterogeneous reasoning differences. Extensive experiments on audio-visual question answering and audio-visual hallucination show that RL-CoMM improves the accuracy by 10~30\% over the baseline model with limited training data. Follow: https://github.com/rikeilong/AVConfusion.

When Eyes and Ears Disagree: Can MLLMs Discern Audio-Visual Confusion?

TL;DR

This work investigates whether Multimodal Large Language Models can discern audio-visual confusion when audio is absent or altered. It introduces AV-ConfuseBench to systematically probe audio-visual asymmetry and proposes RL-CoMM, a two-stage reinforcement-learning framework that leverages a Large Audio Language Model as a reference and employs Step-wise Reasoning Reward and Answer-centered Confidence Optimization to balance audio and visual reasoning. Across AVQA and AVH tasks, RL-CoMM yields 10–30% accuracy gains over baselines with limited training data, demonstrating more robust audio-visual understanding and reduced cross-modal hallucinations. The methods offer a practical path toward more reliable AV reasoning in real-world multimedia systems, enabling better handling of asymmetric AV information.

Abstract

Can Multimodal Large Language Models (MLLMs) discern confused objects that are visually present but audio-absent? To study this, we introduce a new benchmark, AV-ConfuseBench, which simulates an ``Audio-Visual Confusion'' scene by modifying the corresponding sound of an object in the video, e.g., mute the sounding object and ask MLLMs Is there a/an muted-object sound''. Experimental results reveal that MLLMs, such as Qwen2.5-Omni and Gemini 2.5, struggle to discriminate non-existent audio due to visually dominated reasoning. Motivated by this observation, we introduce RL-CoMM, a Reinforcement Learning-based Collaborative Multi-MLLM that is built upon the Qwen2.5-Omni foundation. RL-CoMM includes two stages: 1) To alleviate visually dominated ambiguities, we introduce an external model, a Large Audio Language Model (LALM), as the reference model to generate audio-only reasoning. Then, we design a Step-wise Reasoning Reward function that enables MLLMs to self-improve audio-visual reasoning with the audio-only reference. 2) To ensure an accurate answer prediction, we introduce Answer-centered Confidence Optimization to reduce the uncertainty of potential heterogeneous reasoning differences. Extensive experiments on audio-visual question answering and audio-visual hallucination show that RL-CoMM improves the accuracy by 10~30\% over the baseline model with limited training data. Follow: https://github.com/rikeilong/AVConfusion.

Paper Structure

This paper contains 25 sections, 6 equations, 4 figures, 7 tables, 1 algorithm.

Figures (4)

  • Figure 1: Examples of MLLMs confronting audio-visual confusion.
  • Figure 2: Framework of RL-CoMM, where LALMs serve as the reference model and Omni-LLMs serve as the policy model. Given audio-visual inputs, we first let the LALM generate the reference reasoning for the audio. The policy model is verified by the reviewer (Qwen3 Embedding) to compute group advantages via the Step-wise Reasoning Reward function. Notably, we remove the KL penalty during the policy gradient optimization due to heterogeneous model structure differences. Furthermore, we introduce an Answer-centered Confidence Optimization to reduce uncertainty in the predicted answer of the policy model.
  • Figure 3: Training dynamics of RL-CoMM with Step-wise Reasoning Optimization. The graph below shows the variation of the AVC reward over global steps; the graph above shows the variation of the ARR reward over global steps.
  • Figure 4: Examples of general audio-visual scenes and our designed audio-visual confusion scenes. Questions are answered in the form of yes or no, where the audio information may be intermittent or blocked out.