Table of Contents
Fetching ...

Boosting Audio Visual Question Answering via Key Semantic-Aware Cues

Guangyao Li, Henghui Du, Di Hu

TL;DR

A Temporal-Spatial Perception Model (TSPM) is proposed, which aims to empower the model to perceive key visual and auditory cues related to the questions and construct declarative sentence prompts derived from the question template to assist the temporal perception module in better identifying critical segments relevant to the questions.

Abstract

The Audio Visual Question Answering (AVQA) task aims to answer questions related to various visual objects, sounds, and their interactions in videos. Such naturally multimodal videos contain rich and complex dynamic audio-visual components, with only a portion of them closely related to the given questions. Hence, effectively perceiving audio-visual cues relevant to the given questions is crucial for correctly answering them. In this paper, we propose a Temporal-Spatial Perception Model (TSPM), which aims to empower the model to perceive key visual and auditory cues related to the questions. Specifically, considering the challenge of aligning non-declarative questions and visual representations into the same semantic space using visual-language pretrained models, we construct declarative sentence prompts derived from the question template, to assist the temporal perception module in better identifying critical segments relevant to the questions. Subsequently, a spatial perception module is designed to merge visual tokens from selected segments to highlight key latent targets, followed by cross-modal interaction with audio to perceive potential sound-aware areas. Finally, the significant temporal-spatial cues from these modules are integrated to answer the question. Extensive experiments on multiple AVQA benchmarks demonstrate that our framework excels not only in understanding audio-visual scenes but also in answering complex questions effectively. Code is available at https://github.com/GeWu-Lab/TSPM.

Boosting Audio Visual Question Answering via Key Semantic-Aware Cues

TL;DR

A Temporal-Spatial Perception Model (TSPM) is proposed, which aims to empower the model to perceive key visual and auditory cues related to the questions and construct declarative sentence prompts derived from the question template to assist the temporal perception module in better identifying critical segments relevant to the questions.

Abstract

The Audio Visual Question Answering (AVQA) task aims to answer questions related to various visual objects, sounds, and their interactions in videos. Such naturally multimodal videos contain rich and complex dynamic audio-visual components, with only a portion of them closely related to the given questions. Hence, effectively perceiving audio-visual cues relevant to the given questions is crucial for correctly answering them. In this paper, we propose a Temporal-Spatial Perception Model (TSPM), which aims to empower the model to perceive key visual and auditory cues related to the questions. Specifically, considering the challenge of aligning non-declarative questions and visual representations into the same semantic space using visual-language pretrained models, we construct declarative sentence prompts derived from the question template, to assist the temporal perception module in better identifying critical segments relevant to the questions. Subsequently, a spatial perception module is designed to merge visual tokens from selected segments to highlight key latent targets, followed by cross-modal interaction with audio to perceive potential sound-aware areas. Finally, the significant temporal-spatial cues from these modules are integrated to answer the question. Extensive experiments on multiple AVQA benchmarks demonstrate that our framework excels not only in understanding audio-visual scenes but also in answering complex questions effectively. Code is available at https://github.com/GeWu-Lab/TSPM.
Paper Structure (18 sections, 6 equations, 4 figures, 3 tables)

This paper contains 18 sections, 6 equations, 4 figures, 3 tables.

Figures (4)

  • Figure 1: Identifying key temporal segments and spatial sound-aware areas is critical for fine-grained audio-visual scene understanding through human-like cognitive processes. For instance, in a scenario of violin and flute ensemble, regarding a given complex question: a) directly utilizing the question makes it difficult to effectively select key temporal segments; b) the lack of spatial supervision signals leads to challenges in capturing audio-visual association; c) our method, employing constructed declarative prompts, can accurately locate critical temporal segments and spatial cues.
  • Figure 2: Our proposed Temporal-Spatio Perception Model (TSPM) framework. Firstly, the video is divided into $T$ segments, and we use a pre-trained model to extract audio, visual, and question features. Then, a temporal perception module incorporating a constructed prompt aiming to effectively capture $Top_k$ key relevant temporal segments. Subsequently, the spatial perception module is designed to enhance spatial awareness through the interaction of audio-visual tokens.
  • Figure 3: Spatial Perception Module. Similar tokens are merged. For example, for a given complex scene, the man is playing flute if merged into a single token, and the woman is playing violin is merged into a single token. Following this, the proposed model identifies the sounding instrument, thus inferring the correct answer to the input question.
  • Figure 4: Visualized TSPM results. In the showcased examples, we compared our proposed TSPM with the recent AVQA-related method PSTP-Net pstpnet2023li. It can be observed that TSPM can progressively select relevant temporal segments and locate potential sound-aware areas, thus accurately providing correct answers to the given questions. This process vividly demonstrates TSPM's effective spatiotemporal perception capabilities in complex audiovisual scenarios.