Table of Contents
Fetching ...

FortisAVQA and MAVEN: a Benchmark Dataset and Debiasing Framework for Robust Multimodal Reasoning

Jie Ma, Zhitao Gao, Qi Chai, Jun Liu, Pinghui Wang, Jing Tao, Zhou Su

TL;DR

This work tackles robustness in Audio-Visual Question Answering (AVQA) by identifying dataset biases and proposing a comprehensive debiasing framework. It introduces FortisAVQA, a rephrased and distribution-shifted test dataset derived from MUSIC-AVQA to diagnose head/tail performance, and MAVEN, a generative AVQA model equipped with Multifaceted Cycle Collaborative Debiasing (MCCD) to mitigate unimodal and multimodal biases. The key contributions are the FortisAVQA dataset, the MAVEN architecture with KL-based divergence and cycle-consistency objectives, and extensive ablations demonstrating the role of each component and the plug-and-play nature across baselines. The results show state-of-the-art performance on FortisAVQA and improved robustness on both FortisAVQA and MUSIC-AVQA, highlighting practical gains for reliable multimodal reasoning in the presence of dataset biases.

Abstract

Audio-Visual Question Answering (AVQA) is a challenging multimodal reasoning task requiring intelligent systems to answer natural language queries based on paired audio-video inputs accurately. However, existing AVQA approaches often suffer from overfitting to dataset biases, leading to poor robustness. Moreover, current datasets may not effectively diagnose these methods. To address these challenges, we first introduce a novel dataset, FortisAVQA, constructed in two stages: (1) rephrasing questions in the test split of the public MUSIC-AVQA dataset and (2) introducing distribution shifts across questions. The first stage expands the test space with greater diversity, while the second enables a refined robustness evaluation across rare, frequent, and overall question distributions. Second, we introduce a robust Multimodal Audio-Visual Epistemic Network (MAVEN) that leverages a multifaceted cycle collaborative debiasing strategy to mitigate bias learning. Experimental results demonstrate that our architecture achieves state-of-the-art performance on FortisAVQA, with a notable improvement of 7.81\%. Extensive ablation studies on both datasets validate the effectiveness of our debiasing components. Additionally, our evaluation reveals the limited robustness of existing multimodal QA methods. We also verify the plug-and-play capability of our strategy by integrating it with various baseline models across both datasets. Our dataset and code are available at https://github.com/reml-group/fortisavqa.

FortisAVQA and MAVEN: a Benchmark Dataset and Debiasing Framework for Robust Multimodal Reasoning

TL;DR

This work tackles robustness in Audio-Visual Question Answering (AVQA) by identifying dataset biases and proposing a comprehensive debiasing framework. It introduces FortisAVQA, a rephrased and distribution-shifted test dataset derived from MUSIC-AVQA to diagnose head/tail performance, and MAVEN, a generative AVQA model equipped with Multifaceted Cycle Collaborative Debiasing (MCCD) to mitigate unimodal and multimodal biases. The key contributions are the FortisAVQA dataset, the MAVEN architecture with KL-based divergence and cycle-consistency objectives, and extensive ablations demonstrating the role of each component and the plug-and-play nature across baselines. The results show state-of-the-art performance on FortisAVQA and improved robustness on both FortisAVQA and MUSIC-AVQA, highlighting practical gains for reliable multimodal reasoning in the presence of dataset biases.

Abstract

Audio-Visual Question Answering (AVQA) is a challenging multimodal reasoning task requiring intelligent systems to answer natural language queries based on paired audio-video inputs accurately. However, existing AVQA approaches often suffer from overfitting to dataset biases, leading to poor robustness. Moreover, current datasets may not effectively diagnose these methods. To address these challenges, we first introduce a novel dataset, FortisAVQA, constructed in two stages: (1) rephrasing questions in the test split of the public MUSIC-AVQA dataset and (2) introducing distribution shifts across questions. The first stage expands the test space with greater diversity, while the second enables a refined robustness evaluation across rare, frequent, and overall question distributions. Second, we introduce a robust Multimodal Audio-Visual Epistemic Network (MAVEN) that leverages a multifaceted cycle collaborative debiasing strategy to mitigate bias learning. Experimental results demonstrate that our architecture achieves state-of-the-art performance on FortisAVQA, with a notable improvement of 7.81\%. Extensive ablation studies on both datasets validate the effectiveness of our debiasing components. Additionally, our evaluation reveals the limited robustness of existing multimodal QA methods. We also verify the plug-and-play capability of our strategy by integrating it with various baseline models across both datasets. Our dataset and code are available at https://github.com/reml-group/fortisavqa.

Paper Structure

This paper contains 19 sections, 9 equations, 6 figures, 5 tables, 1 algorithm.

Figures (6)

  • Figure 1: An example illustrating the existing AVQA dataset construction and the comparison between STG and MAVEN. The question in current AVQA datasets is generated by a limited set of predefined templates, which may not be in line with the real-world scenario. Our findings indicate that existing methods such as STG li2022learning are not robust, which may be attributed to excessive bias learning, such as memorizing statistical regularities between critical question words and answers.
  • Figure 2: Rephrasing visualization of FortisAVQA. The left panel showcases a rephrasing example in FortisAVQA, while the middle and right panels depict the question distribution of FortisAVQA and MUSIC-AVQA, respectively, based on their first three words.
  • Figure 3: Statistics visualization for the AVQA task in FortisAVQA. $\mu(a)$ is the average number of answers in a group. The previous split mechanism, published in NeurIPS 2024 ma2024look, assigns all classes to the tails in subfigure (a). In contrast, the newly proposed approach offers greater flexibility by adapting to the data distribution. $k$ is the ratio in Equation \ref{['constraint']}. In the subfigure (f), the dark color denotes the number of head samples, while the light-colored area denotes that of tail samples. In the distribution comparison, we observe a high similarity between train and head sets, whereas the train and tail set exhibit a significant distributional difference. This demonstrates FortisAVQA can evaluate the robustness of multimodal reasoning more comprehensively and precisely compared to existing datasets like MUSIC-AVQA.
  • Figure 4: Illustration of our proposed Multimodal Audio-Visual Epistemic Network (MAVEN). The instructions can be classified into two categories: multimodal fusion and modality-specific bias learning. MFAG denotes the multimodal fusion and answer generation. During the test stage, the module marked with dash lines is removed. denotes the parameter sharing. represents the parameter is frozen. indicates that the parameter is trainable.
  • Figure 5: Senstivity analysis of discrepancy enlargement and cycle constraint in Equation \ref{['eq:dist']} and \ref{['eq:cyc']}.
  • ...and 1 more figures