Table of Contents
Fetching ...

Look, Listen, and Answer: Overcoming Biases for Audio-Visual Question Answering

Jie Ma, Min Hu, Pinghui Wang, Wangchun Sun, Lingyun Song, Hongbin Pei, Jun Liu, Youtian Du

TL;DR

This work tackles biases in Audio-Visual Question Answering (AVQA) by introducing MUSIC-AVQA-R, a rephrasing- and distribution-shift-based robustness benchmark that enables head-tail (in-distribution vs out-of-distribution) evaluation across audio, visual, and AVQA tasks. It also proposes Multifaceted Cycle Collaborative Debiasing (MCCD), a plug-and-play architecture that uses uni-modal bias learners and cross-modal logit regularization through losses $\mathcal{L}_{\mathrm{d}}$, $\mathcal{L}_{\mathrm{c}}$, and $\mathcal{L}_{\mathrm{a}}$ to mitigate bias learning. Across MUSIC-AVQA and MUSIC-AVQA-R, MCCD yields state-of-the-art performance, with substantial gains on MUSIC-AVQA-R (notably $9.32\%$) and robust improvements across head and tail samples, as demonstrated by extensive ablations and qualitative analyses. The work highlights the limited robustness of existing multi-modal QA methods and provides a practical, plug-and-play debiasing framework that can guide more reliable AVQA systems in real-world scenarios.

Abstract

Audio-Visual Question Answering (AVQA) is a complex multi-modal reasoning task, demanding intelligent systems to accurately respond to natural language queries based on audio-video input pairs. Nevertheless, prevalent AVQA approaches are prone to overlearning dataset biases, resulting in poor robustness. Furthermore, current datasets may not provide a precise diagnostic for these methods. To tackle these challenges, firstly, we propose a novel dataset, MUSIC-AVQA-R, crafted in two steps: rephrasing questions within the test split of a public dataset (MUSIC-AVQA) and subsequently introducing distribution shifts to split questions. The former leads to a large, diverse test space, while the latter results in a comprehensive robustness evaluation on rare, frequent, and overall questions. Secondly, we propose a robust architecture that utilizes a multifaceted cycle collaborative debiasing strategy to overcome bias learning. Experimental results show that this architecture achieves state-of-the-art performance on MUSIC-AVQA-R, notably obtaining a significant improvement of 9.32%. Extensive ablation experiments are conducted on the two datasets mentioned to analyze the component effectiveness within the debiasing strategy. Additionally, we highlight the limited robustness of existing multi-modal QA methods through the evaluation on our dataset. We also conduct experiments combining various baselines with our proposed strategy on two datasets to verify its plug-and-play capability. Our dataset and code are available at https://github.com/reml-group/MUSIC-AVQA-R.

Look, Listen, and Answer: Overcoming Biases for Audio-Visual Question Answering

TL;DR

This work tackles biases in Audio-Visual Question Answering (AVQA) by introducing MUSIC-AVQA-R, a rephrasing- and distribution-shift-based robustness benchmark that enables head-tail (in-distribution vs out-of-distribution) evaluation across audio, visual, and AVQA tasks. It also proposes Multifaceted Cycle Collaborative Debiasing (MCCD), a plug-and-play architecture that uses uni-modal bias learners and cross-modal logit regularization through losses , , and to mitigate bias learning. Across MUSIC-AVQA and MUSIC-AVQA-R, MCCD yields state-of-the-art performance, with substantial gains on MUSIC-AVQA-R (notably ) and robust improvements across head and tail samples, as demonstrated by extensive ablations and qualitative analyses. The work highlights the limited robustness of existing multi-modal QA methods and provides a practical, plug-and-play debiasing framework that can guide more reliable AVQA systems in real-world scenarios.

Abstract

Audio-Visual Question Answering (AVQA) is a complex multi-modal reasoning task, demanding intelligent systems to accurately respond to natural language queries based on audio-video input pairs. Nevertheless, prevalent AVQA approaches are prone to overlearning dataset biases, resulting in poor robustness. Furthermore, current datasets may not provide a precise diagnostic for these methods. To tackle these challenges, firstly, we propose a novel dataset, MUSIC-AVQA-R, crafted in two steps: rephrasing questions within the test split of a public dataset (MUSIC-AVQA) and subsequently introducing distribution shifts to split questions. The former leads to a large, diverse test space, while the latter results in a comprehensive robustness evaluation on rare, frequent, and overall questions. Secondly, we propose a robust architecture that utilizes a multifaceted cycle collaborative debiasing strategy to overcome bias learning. Experimental results show that this architecture achieves state-of-the-art performance on MUSIC-AVQA-R, notably obtaining a significant improvement of 9.32%. Extensive ablation experiments are conducted on the two datasets mentioned to analyze the component effectiveness within the debiasing strategy. Additionally, we highlight the limited robustness of existing multi-modal QA methods through the evaluation on our dataset. We also conduct experiments combining various baselines with our proposed strategy on two datasets to verify its plug-and-play capability. Our dataset and code are available at https://github.com/reml-group/MUSIC-AVQA-R.
Paper Structure (29 sections, 7 equations, 12 figures, 6 tables, 1 algorithm)

This paper contains 29 sections, 7 equations, 12 figures, 6 tables, 1 algorithm.

Figures (12)

  • Figure 1: The question in current AVQA datasets is generated by a limited set of predefined templates, which may not be in line with the real-world scenario. Our findings indicate that existing methods yun2021panolin2023vision such as STG li2022learning are not robust, which may be attributed to excessive bias learning, such as memorizing statistical regularities between critical question words and answers.
  • Figure 2: Distribution of rephrasing questions based on the first three words.
  • Figure 3: Statistics visualization for MUSIC-AVQA-R. $\mu(a)$ is the average number of answers in a group. The dark color on the right denotes the number of head samples, while the light-colored area denotes that of tail samples.
  • Figure 4: Robust AVQA architecture to overcome bias learning. Our MCCD strategy is plug-and-play, allowing seamless integration with other AVQA methods.
  • Figure 5: Sensitivity and qualitative analysis. $\alpha$ and $\beta$ are the weight-controlling factors in the MCCD strategy. We visualize attention weights on the uniformly sampled audio and video frames.
  • ...and 7 more figures