Table of Contents
Fetching ...

TAViS: Text-bridged Audio-Visual Segmentation with Foundation Models

Ziyang Luo, Nian Liu, Xuguang Yang, Salman Khan, Rao Muhammad Anwer, Hisham Cholakkal, Fahad Shahbaz Khan, Junwei Han

TL;DR

TAViS tackles audio-visual segmentation by tightly coupling ImageBind's cross-modal alignment with SAM2's segmentation capabilities. It introduces a text-bridged framework comprising IBQD for object-level audio decomposition, sparse/dense text-bridged prompts, and text-bridged alignment losses (audio-to-text and image-to-text) to supervise cross-modal alignment. The approach yields state-of-the-art or competitive results across AVSBench object, semantic, and AVSS scenarios, with strong zero-shot generalization thanks to the text bridge. This work offers a unified, zero-shot-friendly solution that spans binary, semantic, and zero-shot AVS tasks by leveraging foundation-model knowledge through text as an intermediate representation.

Abstract

Audio-Visual Segmentation (AVS) faces a fundamental challenge of effectively aligning audio and visual modalities. While recent approaches leverage foundation models to address data scarcity, they often rely on single-modality knowledge or combine foundation models in an off-the-shelf manner, failing to address the cross-modal alignment challenge. In this paper, we present TAViS, a novel framework that \textbf{couples} the knowledge of multimodal foundation models (ImageBind) for cross-modal alignment and a segmentation foundation model (SAM2) for precise segmentation. However, effectively combining these models poses two key challenges: the difficulty in transferring the knowledge between SAM2 and ImageBind due to their different feature spaces, and the insufficiency of using only segmentation loss for supervision. To address these challenges, we introduce a text-bridged design with two key components: (1) a text-bridged hybrid prompting mechanism where pseudo text provides class prototype information while retaining modality-specific details from both audio and visual inputs, and (2) an alignment supervision strategy that leverages text as a bridge to align shared semantic concepts within audio-visual modalities. Our approach achieves superior performance on single-source, multi-source, semantic datasets, and excels in zero-shot settings.

TAViS: Text-bridged Audio-Visual Segmentation with Foundation Models

TL;DR

TAViS tackles audio-visual segmentation by tightly coupling ImageBind's cross-modal alignment with SAM2's segmentation capabilities. It introduces a text-bridged framework comprising IBQD for object-level audio decomposition, sparse/dense text-bridged prompts, and text-bridged alignment losses (audio-to-text and image-to-text) to supervise cross-modal alignment. The approach yields state-of-the-art or competitive results across AVSBench object, semantic, and AVSS scenarios, with strong zero-shot generalization thanks to the text bridge. This work offers a unified, zero-shot-friendly solution that spans binary, semantic, and zero-shot AVS tasks by leveraging foundation-model knowledge through text as an intermediate representation.

Abstract

Audio-Visual Segmentation (AVS) faces a fundamental challenge of effectively aligning audio and visual modalities. While recent approaches leverage foundation models to address data scarcity, they often rely on single-modality knowledge or combine foundation models in an off-the-shelf manner, failing to address the cross-modal alignment challenge. In this paper, we present TAViS, a novel framework that \textbf{couples} the knowledge of multimodal foundation models (ImageBind) for cross-modal alignment and a segmentation foundation model (SAM2) for precise segmentation. However, effectively combining these models poses two key challenges: the difficulty in transferring the knowledge between SAM2 and ImageBind due to their different feature spaces, and the insufficiency of using only segmentation loss for supervision. To address these challenges, we introduce a text-bridged design with two key components: (1) a text-bridged hybrid prompting mechanism where pseudo text provides class prototype information while retaining modality-specific details from both audio and visual inputs, and (2) an alignment supervision strategy that leverages text as a bridge to align shared semantic concepts within audio-visual modalities. Our approach achieves superior performance on single-source, multi-source, semantic datasets, and excels in zero-shot settings.

Paper Structure

This paper contains 25 sections, 7 equations, 5 figures, 6 tables.

Figures (5)

  • Figure 1: Illustration of TAViS compared with previous methods. (1) Previous methods often rely on single-modality foundation models (top) or combine the visual foundation model with ImageBind in an off-the-shelf manner (bottom), which limits their ability to address the misalignment of audio-visual information with significant intra-class diversity. (2) We propose TAViS, a novel framework that leverages two foundation models: ImageBind and SAM2. The text-bridged alignment between visual and audio modalities serves as both prompts and supervision signals for TAViS. (3) We demonstrate the effectiveness of our approach through comparative experiments with and without the text-bridged mechanism.
  • Figure 2: Overall framework of our proposed model. Our framework integrates two foundation models: SAM2 for precise segmentation and ImageBind for audio-visual alignment. The input audio is first processed by the audio encoder to extract audio embeddings, which are then decomposed into object-level queries via the IBQD module. Meanwhile, the input image is processed by SAM2. The text-bridged hybrid prompts, including sparse prompt ($\bm{p}$) and dense prompt ($\bm{t}_v$), are then generated and fed into the mask decoder to obtain segmentation masks. Finally, two text-bridged alignment losses $\mathcal{L}_{{a2t}}$ and $\mathcal{L}_{{i2t}}$ are applied to supervise the audio-to-text and image-to-text relationships, respectively.
  • Figure 3: Illustration of our text-bridged hybrid prompting mechanism. For the sparse prompt, we combine pseudo text embeddings and specific audio queries generated through MLP. For the dense prompt, we process the image input through ImageBind to obtain the cls token, which is repeated and applied across all pixel locations in SAM2's image feature.
  • Figure 4: t-SNE visualization with text-bridge (up) and without text-bridge (down). Different colors indicate different classes.
  • Figure 5: Qualitative comparison of our model against state-of-the-art AVSS methods. From top to bottom: (1) audio visibility graphs; (2) visual frames; (3) prediction maps from AVSegFormer gao2024avsegformer; (4) prediction maps from SAMA-AVS liu2024annotation (left) and AVSBench zhou2023audio (right); (5) prediction maps from our model; and (6) ground truth.