Look, Listen, and Answer: Overcoming Biases for Audio-Visual Question Answering
Jie Ma, Min Hu, Pinghui Wang, Wangchun Sun, Lingyun Song, Hongbin Pei, Jun Liu, Youtian Du
TL;DR
This work tackles biases in Audio-Visual Question Answering (AVQA) by introducing MUSIC-AVQA-R, a rephrasing- and distribution-shift-based robustness benchmark that enables head-tail (in-distribution vs out-of-distribution) evaluation across audio, visual, and AVQA tasks. It also proposes Multifaceted Cycle Collaborative Debiasing (MCCD), a plug-and-play architecture that uses uni-modal bias learners and cross-modal logit regularization through losses $\mathcal{L}_{\mathrm{d}}$, $\mathcal{L}_{\mathrm{c}}$, and $\mathcal{L}_{\mathrm{a}}$ to mitigate bias learning. Across MUSIC-AVQA and MUSIC-AVQA-R, MCCD yields state-of-the-art performance, with substantial gains on MUSIC-AVQA-R (notably $9.32\%$) and robust improvements across head and tail samples, as demonstrated by extensive ablations and qualitative analyses. The work highlights the limited robustness of existing multi-modal QA methods and provides a practical, plug-and-play debiasing framework that can guide more reliable AVQA systems in real-world scenarios.
Abstract
Audio-Visual Question Answering (AVQA) is a complex multi-modal reasoning task, demanding intelligent systems to accurately respond to natural language queries based on audio-video input pairs. Nevertheless, prevalent AVQA approaches are prone to overlearning dataset biases, resulting in poor robustness. Furthermore, current datasets may not provide a precise diagnostic for these methods. To tackle these challenges, firstly, we propose a novel dataset, MUSIC-AVQA-R, crafted in two steps: rephrasing questions within the test split of a public dataset (MUSIC-AVQA) and subsequently introducing distribution shifts to split questions. The former leads to a large, diverse test space, while the latter results in a comprehensive robustness evaluation on rare, frequent, and overall questions. Secondly, we propose a robust architecture that utilizes a multifaceted cycle collaborative debiasing strategy to overcome bias learning. Experimental results show that this architecture achieves state-of-the-art performance on MUSIC-AVQA-R, notably obtaining a significant improvement of 9.32%. Extensive ablation experiments are conducted on the two datasets mentioned to analyze the component effectiveness within the debiasing strategy. Additionally, we highlight the limited robustness of existing multi-modal QA methods through the evaluation on our dataset. We also conduct experiments combining various baselines with our proposed strategy on two datasets to verify its plug-and-play capability. Our dataset and code are available at https://github.com/reml-group/MUSIC-AVQA-R.
