Table of Contents
Fetching ...

RA-SSU: Towards Fine-Grained Audio-Visual Learning with Region-Aware Sound Source Understanding

Muyi Sun, Yixuan Wang, Hong Wang, Chen Su, Man Zhang, Xingqun Qi, Qi Li, Zhenan Sun

TL;DR

A fine-grained Audio-Visual Learning task, termed Region-Aware Sound Source Understanding (RA-SSU), which aims to achieve region-aware, frame-level, and high-quality sound source understanding and proposes SSUFormer, a Sound-Source Understanding TransFormer benchmark that facilitates both the sound source segmentation and sound region description with a multi-modal input and multi-modal output architecture.

Abstract

Audio-Visual Learning (AVL) is one fundamental task of multi-modality learning and embodied intelligence, displaying the vital role in scene understanding and interaction. However, previous researchers mostly focus on exploring downstream tasks from a coarse-grained perspective (e.g., audio-visual correspondence, sound source localization, and audio-visual event localization). Considering providing more specific scene perception details, we newly define a fine-grained Audio-Visual Learning task, termed Region-Aware Sound Source Understanding (RA-SSU), which aims to achieve region-aware, frame-level, and high-quality sound source understanding. To support this goal, we innovatively construct two corresponding datasets, i.e. fine-grained Music (f-Music) and fine-grained Lifescene (f-Lifescene), each containing annotated sound source masks and frame-by-frame textual descriptions. The f-Music dataset includes 3,976 samples across 22 scene types related to specific application scenarios, focusing on music scenes with complex instrument mixing. The f-Lifescene dataset contains 6,156 samples across 61 types representing diverse sounding objects in life scenarios. Moreover, we propose SSUFormer, a Sound-Source Understanding TransFormer benchmark that facilitates both the sound source segmentation and sound region description with a multi-modal input and multi-modal output architecture. Specifically, we design two modules for this framework, Mask Collaboration Module (MCM) and Mixture of Hierarchical-prompted Experts (MoHE), to respectively enhance the accuracy and enrich the elaboration of the sound source description. Extensive experiments are conducted on our two datasets to verify the feasibility of the task, evaluate the availability of the datasets, and demonstrate the superiority of the SSUFormer, which achieves SOTA performance on the Sound Source Understanding benchmark.

RA-SSU: Towards Fine-Grained Audio-Visual Learning with Region-Aware Sound Source Understanding

TL;DR

A fine-grained Audio-Visual Learning task, termed Region-Aware Sound Source Understanding (RA-SSU), which aims to achieve region-aware, frame-level, and high-quality sound source understanding and proposes SSUFormer, a Sound-Source Understanding TransFormer benchmark that facilitates both the sound source segmentation and sound region description with a multi-modal input and multi-modal output architecture.

Abstract

Audio-Visual Learning (AVL) is one fundamental task of multi-modality learning and embodied intelligence, displaying the vital role in scene understanding and interaction. However, previous researchers mostly focus on exploring downstream tasks from a coarse-grained perspective (e.g., audio-visual correspondence, sound source localization, and audio-visual event localization). Considering providing more specific scene perception details, we newly define a fine-grained Audio-Visual Learning task, termed Region-Aware Sound Source Understanding (RA-SSU), which aims to achieve region-aware, frame-level, and high-quality sound source understanding. To support this goal, we innovatively construct two corresponding datasets, i.e. fine-grained Music (f-Music) and fine-grained Lifescene (f-Lifescene), each containing annotated sound source masks and frame-by-frame textual descriptions. The f-Music dataset includes 3,976 samples across 22 scene types related to specific application scenarios, focusing on music scenes with complex instrument mixing. The f-Lifescene dataset contains 6,156 samples across 61 types representing diverse sounding objects in life scenarios. Moreover, we propose SSUFormer, a Sound-Source Understanding TransFormer benchmark that facilitates both the sound source segmentation and sound region description with a multi-modal input and multi-modal output architecture. Specifically, we design two modules for this framework, Mask Collaboration Module (MCM) and Mixture of Hierarchical-prompted Experts (MoHE), to respectively enhance the accuracy and enrich the elaboration of the sound source description. Extensive experiments are conducted on our two datasets to verify the feasibility of the task, evaluate the availability of the datasets, and demonstrate the superiority of the SSUFormer, which achieves SOTA performance on the Sound Source Understanding benchmark.
Paper Structure (29 sections, 6 equations, 10 figures, 6 tables)

This paper contains 29 sections, 6 equations, 10 figures, 6 tables.

Figures (10)

  • Figure 1: Comparisons of the Region-Aware Sound-Source Understanding task with several previous Audio-Visual Learning tasks. (a).Audio-Visual Correspondence. Holistic correspondence with audio-level and video-level alignment. (b).Sound Source Localization. Coarse-grained (spatial) and region-level sound source localization. (c).Audio-Visual Event Localization. Coarse-grained (temporal) event localization with frame boundaries. (d). Region-Aware Sound-Source Understanding. Fine-grained audio-visual learning with regional sound source localization and sound region description.
  • Figure 2: Data samples from the created f-Music and f-Lifescene. The top three are from f-Music and down from f-Lifescene. For each sample, three frames are selected from 10 seconds. Each sample contains a 10s video and its frame-level sound source masks and descriptions. Specifically, the frame-by-frame description of a data portion varies with the change of the sounding object, as shown in the upper-right sample. (Zoom in for better details)
  • Figure 3: The framework of the RA-SSU task. Multi-modal inputs are first processed by modality-specific encoders. Then the main networks are designed to align, interact, and integrate audio-visual features. Task Decoders realize the mask and description prediction combined with the initial representations.
  • Figure 4: Data annotation process and labeling system for the proposed sound-source understanding task. In this system, the video data is first uploaded into the system. Then the SAM model is used to obtain the initial masks. Based on the initial masks, the TAM is used to gain the frame-level video masks. Finally, the masked region of the frame-level images is fed into the Chat-Univi to get the region-aware descriptions. (Zoom in for better details.)
  • Figure 5: SSUFormer: fine-grained Sound-Source Understanding Benchmark. On the left of this architecture, the audio and video are fed into the encoders and mapped to the token representations. Then, the multi-modality features are fused with the attention mechanism in Fig.6. Next, the previous features are integrated into task decoders for mask and description generation. (Zoom in for better details.)
  • ...and 5 more figures