Answering Diverse Questions via Text Attached with Key Audio-Visual Clues

Qilang Ye; Zitong Yu; Xin Liu

Answering Diverse Questions via Text Attached with Key Audio-Visual Clues

Qilang Ye, Zitong Yu, Xin Liu

TL;DR

This work tackles AVQA by mitigating redundancy and heterogeneity in audio-visual features through Mutual Correlation Distillation (MCD). It introduces a Mutual Correlation Module to generate combinatorial question embeddings via soft audiovisual clue mining, complemented by a Semantic Approximation Module that aligns modalities in a shared latent space using contrastive learning and distillation. The approach eliminates heavy dependence on decision-level fusion, yielding improved generalization across diverse questions and datasets such as Music-AVQA and AVQA, with extensive ablations confirming the contribution of each component. The results suggest that question-guided, clue-based reasoning with cross-modal alignment can enhance multimodal QA performance in realistic settings while reducing overfitting.

Abstract

Audio-visual question answering (AVQA) requires reference to video content and auditory information, followed by correlating the question to predict the most precise answer. Although mining deeper layers of audio-visual information to interact with questions facilitates the multimodal fusion process, the redundancy of audio-visual parameters tends to reduce the generalization of the inference engine to multiple question-answer pairs in a single video. Indeed, the natural heterogeneous relationship between audiovisuals and text makes the perfect fusion challenging, to prevent high-level audio-visual semantics from weakening the network's adaptability to diverse question types, we propose a framework for performing mutual correlation distillation (MCD) to aid question inference. MCD is divided into three main steps: 1) firstly, the residual structure is utilized to enhance the audio-visual soft associations based on self-attention, then key local audio-visual features relevant to the question context are captured hierarchically by shared aggregators and coupled in the form of clues with specific question vectors. 2) Secondly, knowledge distillation is enforced to align audio-visual-text pairs in a shared latent space to narrow the cross-modal semantic gap. 3) And finally, the audio-visual dependencies are decoupled by discarding the decision-level integrations. We evaluate the proposed method on two publicly available datasets containing multiple question-and-answer pairs, i.e., Music-AVQA and AVQA. Experiments show that our method outperforms other state-of-the-art methods, and one interesting finding behind is that removing deep audio-visual features during inference can effectively mitigate overfitting. The source code is released at http://github.com/rikeilong/MCD-forAVQA.

Answering Diverse Questions via Text Attached with Key Audio-Visual Clues

TL;DR

Abstract

Paper Structure (32 sections, 14 equations, 6 figures, 10 tables)

This paper contains 32 sections, 14 equations, 6 figures, 10 tables.

Introduction
Related Works
Audio-Visual Question Answering
Video Question Answering Datasets
Video Question Answering Methods
Transformer-based Cross-attention
Proposed Method
Basic Encoders
Audio-Visual-Text Inputs
Audio-Visual-Text Embeddings
Mutual Correlation Module
Association Block
Aggregator
Semantic Approximation Module
Audio-Visual Knowledge Distillation
...and 17 more sections

Figures (6)

Figure 1: An illustration of our proposed mutual correlation distillation guidance. MCM denotes the mutual correlation module. The question content generated in multiple videos may be identical, thus in proposing the concept of knowledge distillation via contrastive learning, we first guide the question through the audio-visual to distinguish differences in the same question across samples. (a) is a simplified generation process of combinatorial question embedding. (b) is our proposal to alleviate semantic ambiguity between cross-modalities by approximating positive pairs and separating negative pairs in a contrastive way.
Figure 2: An overview of our MCD, where the dotted lines are the main contributions. The features $\bm{v}$, $\bm{a}$, and $\bm{q}$ are first obtained by the basic encoders, and to focus on the fine-grained semantic information in the sentences, we also combine the keywords with the question types to form text objects. Further, the audio-visual embeddings $f_{\bm{v}}$ and $f_{\bm{a}}$ will go through enhanced self-attention in $N$ association blocks to get the advanced audio-visual embeddings $\hat{f_{\bm{v}}}$ and $\hat{f_{\bm{a}}}$ (here we learn the audio-visual contrasts at the attention layer to emphasize coordination), which are then fed into the aggregator along with the question features $f_{\bm{q}}$ to generate the combinatorial question embeddings $\hat{f_{\bm{q}}}$, and it is worth noting that we add an additional optional branch to fuse the audio-visual embeddings with the text objects. In order for the combinatorial question embeddings to further learn the audio-visual knowledge, we propose to distill the knowledge in a shared latent space. Finally, the combinatorial question embeddings will be used to infer answers.
Figure 3: An overview of the Mutual Correlation Module (MCM). (a) proposes a Transformer-based cross-attention mechanism, (b) shows constituent elements of the text object, (c) demonstrates the multimodal correlation process, where the early correlation and late correlation are to ensure that both shallow independent audio-visual features and fine-grained audio-visual interaction features are learned in a balanced way, (d) demonstrates the internal structure of the aggregator. Specifically, the input audio-visual embeddings will first go through soft association to attach a small volume of information about each other to yield a high-level audio-visual embedding, followed by the low-high level audio-visual embeddings and the question features into the shared aggregator respectively, which additionally inputs text objects and finally outputs the combinatorial question embeddings.
Figure 4: Our proposed method of testing multimodal fusion, (a) demonstrates the abandonment of decision-level fusion, (b) demonstrates the typical concatenation and element-wise add process.
Figure 5: An illustration of the impact of multimodal late fusion on inference results.
...and 1 more figures

Answering Diverse Questions via Text Attached with Key Audio-Visual Clues

TL;DR

Abstract

Answering Diverse Questions via Text Attached with Key Audio-Visual Clues

Authors

TL;DR

Abstract

Table of Contents

Figures (6)