Table of Contents
Fetching ...

AV-Unified: A Unified Framework for Audio-visual Scene Understanding

Guangyao Li, Xin Wang, Wenwu Zhu

TL;DR

AV-Unified standardizes the diverse input-output formats of each task and incorporates a multi-scale spatiotemporal perception network to effectively capture audio-visual associations, establishing a shared representation that allows a single architecture to be trained jointly across heterogeneous varied datasets.

Abstract

When humans perceive the world, they naturally integrate multiple audio-visual tasks within dynamic, real-world scenes. However, current works such as event localization, parsing, segmentation and question answering are mostly explored individually, making it challenging to comprehensively understand complex audio-visual scenes and explore inter-task relationships. Hence, we propose \textbf{AV-Unified}, a unified framework that enables joint learning across a wide range of audio-visual scene understanding tasks. AV-Unified standardizes the diverse input-output formats of each task and incorporates a multi-scale spatiotemporal perception network to effectively capture audio-visual associations. Specifically, we unify the inputs and outputs of all supported tasks by converting them into sequences of discrete tokens, establishing a shared representation that allows a single architecture to be trained jointly across heterogeneous varied datasets. Considering the varying temporal granularity of audio-visual events, a multi-scale temporal perception module is designed to capture key cues. Meanwhile, to overcome the lack of auditory supervision in the visual domain, we design a cross-modal guidance-based spatial perception module that models spatial audio-visual associations. Furthermore, task-specific text prompts are employed to enhance the model's adaptability and task-awareness. Extensive experiments on benchmark datasets (e.g., AVE, LLP, MUSIC-AVQA, VGG-SS and AVS) demonstrate the effectiveness of AV-Unified across temporal, spatial, and spatiotemporal tasks.

AV-Unified: A Unified Framework for Audio-visual Scene Understanding

TL;DR

AV-Unified standardizes the diverse input-output formats of each task and incorporates a multi-scale spatiotemporal perception network to effectively capture audio-visual associations, establishing a shared representation that allows a single architecture to be trained jointly across heterogeneous varied datasets.

Abstract

When humans perceive the world, they naturally integrate multiple audio-visual tasks within dynamic, real-world scenes. However, current works such as event localization, parsing, segmentation and question answering are mostly explored individually, making it challenging to comprehensively understand complex audio-visual scenes and explore inter-task relationships. Hence, we propose \textbf{AV-Unified}, a unified framework that enables joint learning across a wide range of audio-visual scene understanding tasks. AV-Unified standardizes the diverse input-output formats of each task and incorporates a multi-scale spatiotemporal perception network to effectively capture audio-visual associations. Specifically, we unify the inputs and outputs of all supported tasks by converting them into sequences of discrete tokens, establishing a shared representation that allows a single architecture to be trained jointly across heterogeneous varied datasets. Considering the varying temporal granularity of audio-visual events, a multi-scale temporal perception module is designed to capture key cues. Meanwhile, to overcome the lack of auditory supervision in the visual domain, we design a cross-modal guidance-based spatial perception module that models spatial audio-visual associations. Furthermore, task-specific text prompts are employed to enhance the model's adaptability and task-awareness. Extensive experiments on benchmark datasets (e.g., AVE, LLP, MUSIC-AVQA, VGG-SS and AVS) demonstrate the effectiveness of AV-Unified across temporal, spatial, and spatiotemporal tasks.
Paper Structure (19 sections, 9 equations, 4 figures, 7 tables)

This paper contains 19 sections, 9 equations, 4 figures, 7 tables.

Figures (4)

  • Figure 1: AV-Unified is a single sequence-to-sequence model that performs a variety of audio-visual tasks using a unified architecture without a need for either task or modality specific branches. A schematic of the model with multiple demonstrative audio-visual tasks: event localization, video parsing, sound source localization, segmentation and question answering.
  • Figure 2: The proposed Multi-scale Temporal-Spatial Perception Framework. First, the visual and audio features extracted by the encoder are fed into a temporal perception module to capture key audio-visual temporal cues. Then, a spatial perception module performs cross-modal guidance and interaction based on these temporal cues, uncovering spatial associations between the audio and visual modalities. Next, carefully designed task-specific textual prompts guide the model to focus on features that are most relevant to the current task. Finally, the learned representations are serialized and passed to task-specific decoders to address different downstream audiovisual scene understanding tasks.
  • Figure 3: Encoder and decoder for AVS task.
  • Figure 4: Visualization of the Temporal-Spatial Task (AVQA). The audio-visual representation of the input video is first processed by the AV-Unified framework to obtain a unified representation, corresponding to the w/o task-specific prompt setting. Then, a task-specific prompt is applied to guide the model in selecting the most relevant features required for the task, corresponding to the w/ task-specific prompt setting.