Table of Contents
Fetching ...

SaSR-Net: Source-Aware Semantic Representation Network for Enhancing Audio-Visual Question Answering

Tianyu Yang, Yiyang Nan, Lisen Dai, Zhenwen Liang, Yapeng Tian, Xiangliang Zhang

TL;DR

The Source-aware Semantic Representation Network (SaSR-Net), a novel model designed for AVQA that streamlines the fusion of audio and visual information using spatial and temporal attention mechanisms to identify answers in multi-modal scenes is introduced.

Abstract

Audio-Visual Question Answering (AVQA) is a challenging task that involves answering questions based on both auditory and visual information in videos. A significant challenge is interpreting complex multi-modal scenes, which include both visual objects and sound sources, and connecting them to the given question. In this paper, we introduce the Source-aware Semantic Representation Network (SaSR-Net), a novel model designed for AVQA. SaSR-Net utilizes source-wise learnable tokens to efficiently capture and align audio-visual elements with the corresponding question. It streamlines the fusion of audio and visual information using spatial and temporal attention mechanisms to identify answers in multi-modal scenes. Extensive experiments on the Music-AVQA and AVQA-Yang datasets show that SaSR-Net outperforms state-of-the-art AVQA methods.

SaSR-Net: Source-Aware Semantic Representation Network for Enhancing Audio-Visual Question Answering

TL;DR

The Source-aware Semantic Representation Network (SaSR-Net), a novel model designed for AVQA that streamlines the fusion of audio and visual information using spatial and temporal attention mechanisms to identify answers in multi-modal scenes is introduced.

Abstract

Audio-Visual Question Answering (AVQA) is a challenging task that involves answering questions based on both auditory and visual information in videos. A significant challenge is interpreting complex multi-modal scenes, which include both visual objects and sound sources, and connecting them to the given question. In this paper, we introduce the Source-aware Semantic Representation Network (SaSR-Net), a novel model designed for AVQA. SaSR-Net utilizes source-wise learnable tokens to efficiently capture and align audio-visual elements with the corresponding question. It streamlines the fusion of audio and visual information using spatial and temporal attention mechanisms to identify answers in multi-modal scenes. Extensive experiments on the Music-AVQA and AVQA-Yang datasets show that SaSR-Net outperforms state-of-the-art AVQA methods.

Paper Structure

This paper contains 17 sections, 13 equations, 4 figures, 5 tables.

Figures (4)

  • Figure 1: Leveraging semantic representation for AVQA involves: (1) Extracting features of various instrument types based on semantic tokens, (2) Identifying the location of the relevant sounding instruments, and (3) Establishing connections between the extracted semantic features, identified instrument locations, and the crucial parts of the question, guiding the model to answer the question accurately.
  • Figure 2: The architecture of the proposed SaSR-Net.
  • Figure 3: Visualization of Spatial Attention (SA) and Temporal Attention (TA) Blocks. The SA Block heatmaps pinpoint sounding object locations, and the TA Block displays audio-visual feature scores. SA localizes critical visual areas, while TA synchronizes video moments with questions, boosting overall audio-visual comprehension.
  • Figure 4: Comparison of our SaSR-Net and AVST li2022learning. Our SaSR-Net provides more precise answers to complex questions by effectively integrating semantic information into audio and visual features.