FortisAVQA and MAVEN: a Benchmark Dataset and Debiasing Framework for Robust Multimodal Reasoning

Jie Ma; Zhitao Gao; Qi Chai; Jun Liu; Pinghui Wang; Jing Tao; Zhou Su

FortisAVQA and MAVEN: a Benchmark Dataset and Debiasing Framework for Robust Multimodal Reasoning

Jie Ma, Zhitao Gao, Qi Chai, Jun Liu, Pinghui Wang, Jing Tao, Zhou Su

TL;DR

This work tackles robustness in Audio-Visual Question Answering (AVQA) by identifying dataset biases and proposing a comprehensive debiasing framework. It introduces FortisAVQA, a rephrased and distribution-shifted test dataset derived from MUSIC-AVQA to diagnose head/tail performance, and MAVEN, a generative AVQA model equipped with Multifaceted Cycle Collaborative Debiasing (MCCD) to mitigate unimodal and multimodal biases. The key contributions are the FortisAVQA dataset, the MAVEN architecture with KL-based divergence and cycle-consistency objectives, and extensive ablations demonstrating the role of each component and the plug-and-play nature across baselines. The results show state-of-the-art performance on FortisAVQA and improved robustness on both FortisAVQA and MUSIC-AVQA, highlighting practical gains for reliable multimodal reasoning in the presence of dataset biases.

Abstract

Audio-Visual Question Answering (AVQA) is a challenging multimodal reasoning task requiring intelligent systems to answer natural language queries based on paired audio-video inputs accurately. However, existing AVQA approaches often suffer from overfitting to dataset biases, leading to poor robustness. Moreover, current datasets may not effectively diagnose these methods. To address these challenges, we first introduce a novel dataset, FortisAVQA, constructed in two stages: (1) rephrasing questions in the test split of the public MUSIC-AVQA dataset and (2) introducing distribution shifts across questions. The first stage expands the test space with greater diversity, while the second enables a refined robustness evaluation across rare, frequent, and overall question distributions. Second, we introduce a robust Multimodal Audio-Visual Epistemic Network (MAVEN) that leverages a multifaceted cycle collaborative debiasing strategy to mitigate bias learning. Experimental results demonstrate that our architecture achieves state-of-the-art performance on FortisAVQA, with a notable improvement of 7.81\%. Extensive ablation studies on both datasets validate the effectiveness of our debiasing components. Additionally, our evaluation reveals the limited robustness of existing multimodal QA methods. We also verify the plug-and-play capability of our strategy by integrating it with various baseline models across both datasets. Our dataset and code are available at https://github.com/reml-group/fortisavqa.

FortisAVQA and MAVEN: a Benchmark Dataset and Debiasing Framework for Robust Multimodal Reasoning

TL;DR

Abstract

FortisAVQA and MAVEN: a Benchmark Dataset and Debiasing Framework for Robust Multimodal Reasoning

TL;DR

Abstract

Paper Structure

Table of Contents

Figures (6)