Table of Contents
Fetching ...

Prompting Segmentation with Sound Is Generalizable Audio-Visual Source Localizer

Yaoting Wang, Weisong Liu, Guangyao Li, Jian Ding, Di Hu, Xi Li

TL;DR

This work tackles generalization in Audio-Visual Segmentation under zero-shot and few-shot conditions by replacing the traditional encoder-fusion-decoder pipeline with an encoder-prompt-decoder framework that leverages a visual foundation model. It introduces Semantic-aware Audio Prompt (SAP) to align audio and visual semantics and a Correlation Adapter (ColA) to preserve pre-trained visual priors while constructing audio-visual correlations through the Audio Source Decoder. The approach, evaluated on AVS-Benchmarks, AVS-V3, and VGG-SS, demonstrates superior cross-domain and unseen-class performance and strong data efficiency, outperforming fusion-based baselines in zero-shot and few-shot regimes. These results highlight the potential of prompt-based multimodal reasoning with large pre-trained models to improve practical generalization in AVS tasks, especially where labeled data are scarce or distribution shifts are common.

Abstract

Never having seen an object and heard its sound simultaneously, can the model still accurately localize its visual position from the input audio? In this work, we concentrate on the Audio-Visual Localization and Segmentation tasks but under the demanding zero-shot and few-shot scenarios. To achieve this goal, different from existing approaches that mostly employ the encoder-fusion-decoder paradigm to decode localization information from the fused audio-visual feature, we introduce the encoder-prompt-decoder paradigm, aiming to better fit the data scarcity and varying data distribution dilemmas with the help of abundant knowledge from pre-trained models. Specifically, we first propose to construct Semantic-aware Audio Prompt (SAP) to help the visual foundation model focus on sounding objects, meanwhile, the semantic gap between the visual and audio modalities is also encouraged to shrink. Then, we develop a Correlation Adapter (ColA) to keep minimal training efforts as well as maintain adequate knowledge of the visual foundation model. By equipping with these means, extensive experiments demonstrate that this new paradigm outperforms other fusion-based methods in both the unseen class and cross-dataset settings. We hope that our work can further promote the generalization study of Audio-Visual Localization and Segmentation in practical application scenarios.

Prompting Segmentation with Sound Is Generalizable Audio-Visual Source Localizer

TL;DR

This work tackles generalization in Audio-Visual Segmentation under zero-shot and few-shot conditions by replacing the traditional encoder-fusion-decoder pipeline with an encoder-prompt-decoder framework that leverages a visual foundation model. It introduces Semantic-aware Audio Prompt (SAP) to align audio and visual semantics and a Correlation Adapter (ColA) to preserve pre-trained visual priors while constructing audio-visual correlations through the Audio Source Decoder. The approach, evaluated on AVS-Benchmarks, AVS-V3, and VGG-SS, demonstrates superior cross-domain and unseen-class performance and strong data efficiency, outperforming fusion-based baselines in zero-shot and few-shot regimes. These results highlight the potential of prompt-based multimodal reasoning with large pre-trained models to improve practical generalization in AVS tasks, especially where labeled data are scarce or distribution shifts are common.

Abstract

Never having seen an object and heard its sound simultaneously, can the model still accurately localize its visual position from the input audio? In this work, we concentrate on the Audio-Visual Localization and Segmentation tasks but under the demanding zero-shot and few-shot scenarios. To achieve this goal, different from existing approaches that mostly employ the encoder-fusion-decoder paradigm to decode localization information from the fused audio-visual feature, we introduce the encoder-prompt-decoder paradigm, aiming to better fit the data scarcity and varying data distribution dilemmas with the help of abundant knowledge from pre-trained models. Specifically, we first propose to construct Semantic-aware Audio Prompt (SAP) to help the visual foundation model focus on sounding objects, meanwhile, the semantic gap between the visual and audio modalities is also encouraged to shrink. Then, we develop a Correlation Adapter (ColA) to keep minimal training efforts as well as maintain adequate knowledge of the visual foundation model. By equipping with these means, extensive experiments demonstrate that this new paradigm outperforms other fusion-based methods in both the unseen class and cross-dataset settings. We hope that our work can further promote the generalization study of Audio-Visual Localization and Segmentation in practical application scenarios.
Paper Structure (29 sections, 11 equations, 7 figures, 8 tables)

This paper contains 29 sections, 11 equations, 7 figures, 8 tables.

Figures (7)

  • Figure 1: The AVS pipeline of encoder-fusion-decoder (the upper-center) and our proposed encoder-prompt-decoder (the lower-center) paradigms. Classical encoder-fusion-decoder methods decode mask from the fused modality while we prompting visual input with audio to adapt AVL and AVS tasks to the visual foundational model. The results on the VGG-SS dataset highlight the challenge of generalizing across different datasets. However, our approach breaks through the 40% cIoU barrier, getting the performance closer to the best trained on in-set (VGG-Sound) method.
  • Figure 1: Visualization of segmentation results on the supervised AVS-V2 dataset.
  • Figure 2: The overview of GAVS. (1) We firstly align the audio and visual semantics for SAP, and introduce visual features as cues (the green one in $F_{A^\prime}$) for audio input (the blue one in $F_{A^\prime}$). Then we further combine audio input with learnable adaptive noise (the pink one in $F_{A^\prime}$) to construct the final SAP $F_{A^\prime}$, and get the projected prompt $F_P$. (2) Next, we utilize cross-modal attention to learn the correlation between audio and visual in the Audio Source Decoder, projecting audio into the visual space. The self-attention for $F_P$ before the first cross-modal attention is omitted for clarity.
  • Figure 2: Visualization of segmentation results on the supervised AVS-V2 test set.
  • Figure 3: Visualization of performance improvements of AVS models on the AVS-V2 dataset in relation to the amount of data used for training. We compare models with subsets consisting of 10%, 30%, and 50% of the full dataset. Our results show that our method achieves better performance with only 10% of the training data compared to other models trained with 30%. Moreover, our model outperforms other models trained on the full dataset when trained with only half of the data.
  • ...and 2 more figures