Table of Contents
Fetching ...

Audio Visual Segmentation Through Text Embeddings

Kyungbok Lee, You Zhang, Zhiyao Duan

TL;DR

This work tackles the data scarcity challenge in audio-visual segmentation (AVS) by introducing AV2T-SAM, which maps audio prompts into the text embedding space of a pre-trained text-prompted SAM, enabling cross-modal semantic alignment from text-image data. A semantically aligned feature f_{CLIP} \odot f_{CLAP} is proposed to fuse audio and visual semantics, while adapters fuse audio cues into a frozen SAM encoder for efficient training. The method achieves state-of-the-art performance on AVSBench datasets (S4 and MS3) and reveals a vision bias in S4, where high segmentation accuracy can be achieved with visual cues alone. The approach highlights the value of cross-modal embeddings and text-driven segmentation foundations for robust AVS with limited labeled data.

Abstract

The goal of Audio-Visual Segmentation (AVS) is to localize and segment the sounding source objects from video frames. Research on AVS suffers from data scarcity due to the high cost of fine-grained manual annotations. Recent works attempt to overcome the challenge of limited data by leveraging the vision foundation model, Segment Anything Model (SAM), prompting it with audio to enhance its ability to segment sounding source objects. While this approach alleviates the model's burden on understanding visual modality by utilizing knowledge of pre-trained SAM, it does not address the fundamental challenge of learning audio-visual correspondence with limited data. To address this limitation, we propose \textbf{AV2T-SAM}, a novel framework that bridges audio features with the text embedding space of pre-trained text-prompted SAM. Our method leverages multimodal correspondence learned from rich text-image paired datasets to enhance audio-visual alignment. Furthermore, we introduce a novel feature, $\mathbf{\textit{\textbf{f}}_{CLIP} \odot \textit{\textbf{f}}_{CLAP}}$, which emphasizes shared semantics of audio and visual modalities while filtering irrelevant noise. Our approach outperforms existing methods on the AVSBench dataset by effectively utilizing pre-trained segmentation models and cross-modal semantic alignment. The source code is released at https://github.com/bok-bok/AV2T-SAM.

Audio Visual Segmentation Through Text Embeddings

TL;DR

This work tackles the data scarcity challenge in audio-visual segmentation (AVS) by introducing AV2T-SAM, which maps audio prompts into the text embedding space of a pre-trained text-prompted SAM, enabling cross-modal semantic alignment from text-image data. A semantically aligned feature f_{CLIP} \odot f_{CLAP} is proposed to fuse audio and visual semantics, while adapters fuse audio cues into a frozen SAM encoder for efficient training. The method achieves state-of-the-art performance on AVSBench datasets (S4 and MS3) and reveals a vision bias in S4, where high segmentation accuracy can be achieved with visual cues alone. The approach highlights the value of cross-modal embeddings and text-driven segmentation foundations for robust AVS with limited labeled data.

Abstract

The goal of Audio-Visual Segmentation (AVS) is to localize and segment the sounding source objects from video frames. Research on AVS suffers from data scarcity due to the high cost of fine-grained manual annotations. Recent works attempt to overcome the challenge of limited data by leveraging the vision foundation model, Segment Anything Model (SAM), prompting it with audio to enhance its ability to segment sounding source objects. While this approach alleviates the model's burden on understanding visual modality by utilizing knowledge of pre-trained SAM, it does not address the fundamental challenge of learning audio-visual correspondence with limited data. To address this limitation, we propose \textbf{AV2T-SAM}, a novel framework that bridges audio features with the text embedding space of pre-trained text-prompted SAM. Our method leverages multimodal correspondence learned from rich text-image paired datasets to enhance audio-visual alignment. Furthermore, we introduce a novel feature, , which emphasizes shared semantics of audio and visual modalities while filtering irrelevant noise. Our approach outperforms existing methods on the AVSBench dataset by effectively utilizing pre-trained segmentation models and cross-modal semantic alignment. The source code is released at https://github.com/bok-bok/AV2T-SAM.

Paper Structure

This paper contains 20 sections, 4 equations, 2 figures, 2 tables, 1 algorithm.

Figures (2)

  • Figure 1: The overview of our proposed AV2T-SAM framework. Algorithm \ref{['algorithm']} specifies the process of the AV2T-SAM framework.
  • Figure 2: Comparison with other methods on examples. Our method successfully separates and segments a whole object compared to other methods.