Table of Contents
Fetching ...

Towards Multilingual Audio-Visual Question Answering

Orchid Chetia Phukan, Priyabrata Mallick, Swarup Ranjan Behera, Aalekhya Satya Narayani, Arun Balaji Buduru, Rajesh Sharma

TL;DR

This work tackles multilingual Audio-Visual Question Answering by converting existing AVQA benchmarks into eight languages through machine translation, yielding m-MUSIC-AVQA and m-AVQA. It introduces the MERA framework, leveraging frozen foundation models VideoMAE, AST, and mBERT, with three model variants (MERA-L, MERA-C, MERA-T) and a weighted ensemble to benchmark multilingual AVQA. The results indicate MERA-C often achieves the best per-language performance, while ensemble methods provide robust gains across languages and question types, offering a practical, scalable baseline for multilingual AVQA. Overall, the study provides datasets and baselines to spur future multilingual AVQA research and applications in diverse linguistic contexts.

Abstract

In this paper, we work towards extending Audio-Visual Question Answering (AVQA) to multilingual settings. Existing AVQA research has predominantly revolved around English and replicating it for addressing AVQA in other languages requires a substantial allocation of resources. As a scalable solution, we leverage machine translation and present two multilingual AVQA datasets for eight languages created from existing benchmark AVQA datasets. This prevents extra human annotation efforts of collecting questions and answers manually. To this end, we propose, MERA framework, by leveraging state-of-the-art (SOTA) video, audio, and textual foundation models for AVQA in multiple languages. We introduce a suite of models namely MERA-L, MERA-C, MERA-T with varied model architectures to benchmark the proposed datasets. We believe our work will open new research directions and act as a reference benchmark for future works in multilingual AVQA.

Towards Multilingual Audio-Visual Question Answering

TL;DR

This work tackles multilingual Audio-Visual Question Answering by converting existing AVQA benchmarks into eight languages through machine translation, yielding m-MUSIC-AVQA and m-AVQA. It introduces the MERA framework, leveraging frozen foundation models VideoMAE, AST, and mBERT, with three model variants (MERA-L, MERA-C, MERA-T) and a weighted ensemble to benchmark multilingual AVQA. The results indicate MERA-C often achieves the best per-language performance, while ensemble methods provide robust gains across languages and question types, offering a practical, scalable baseline for multilingual AVQA. Overall, the study provides datasets and baselines to spur future multilingual AVQA research and applications in diverse linguistic contexts.

Abstract

In this paper, we work towards extending Audio-Visual Question Answering (AVQA) to multilingual settings. Existing AVQA research has predominantly revolved around English and replicating it for addressing AVQA in other languages requires a substantial allocation of resources. As a scalable solution, we leverage machine translation and present two multilingual AVQA datasets for eight languages created from existing benchmark AVQA datasets. This prevents extra human annotation efforts of collecting questions and answers manually. To this end, we propose, MERA framework, by leveraging state-of-the-art (SOTA) video, audio, and textual foundation models for AVQA in multiple languages. We introduce a suite of models namely MERA-L, MERA-C, MERA-T with varied model architectures to benchmark the proposed datasets. We believe our work will open new research directions and act as a reference benchmark for future works in multilingual AVQA.
Paper Structure (11 sections, 1 equation, 2 figures, 3 tables)

This paper contains 11 sections, 1 equation, 2 figures, 3 tables.

Figures (2)

  • Figure 1: Proposed Framework, MERA; Here, MERA takes the video, audio, text (Question) as the input and answer is the output; The foundation models VideoMAE, AST, mBERT are kept frozen; Suite of three models namely MERA-L (i), MERA-C (ii), MERA-T (iii); Ensemble Module represents the weighted-ensemble of the three models
  • Figure 2: Multilingual MUSIC-AVQA (m-MUSIC-AVQA) dataset in eight languages. From top to bottom: English (en), French (fr), Hindi (hi), German (de), Spanish (es), Italian (it), Dutch (nl), and Portuguese (pt)