Table of Contents
Fetching ...

CPM: Class-conditional Prompting Machine for Audio-visual Segmentation

Yuanhong Chen, Chong Wang, Yuyuan Liu, Hu Wang, Gustavo Carneiro

TL;DR

The paper tackles AVS by addressing cross-modal interaction and training instability in transformer-based architectures. It introduces CPM, which combines class-agnostic and class-conditional prompts, sampling class-specific embeddings from a Gaussian Mixture Model to guide both audio and visual prompting (CCDM, ACP, VCP) and adds a prompting-based contrastive learning objective (PCL). Through extensive experiments on AVSBench and VPO datasets, CPM achieves state-of-the-art segmentation performance, with ablations confirming the effectiveness of each component and the stability improvements in bipartite matching. The work advances AVS by enabling more explicit audio-visual alignment and robust instance-level segmentation, with potential for broader impact in multimodal perception tasks.

Abstract

Audio-visual segmentation (AVS) is an emerging task that aims to accurately segment sounding objects based on audio-visual cues. The success of AVS learning systems depends on the effectiveness of cross-modal interaction. Such a requirement can be naturally fulfilled by leveraging transformer-based segmentation architecture due to its inherent ability to capture long-range dependencies and flexibility in handling different modalities. However, the inherent training issues of transformer-based methods, such as the low efficacy of cross-attention and unstable bipartite matching, can be amplified in AVS, particularly when the learned audio query does not provide a clear semantic clue. In this paper, we address these two issues with the new Class-conditional Prompting Machine (CPM). CPM improves the bipartite matching with a learning strategy combining class-agnostic queries with class-conditional queries. The efficacy of cross-modal attention is upgraded with new learning objectives for the audio, visual and joint modalities. We conduct experiments on AVS benchmarks, demonstrating that our method achieves state-of-the-art (SOTA) segmentation accuracy.

CPM: Class-conditional Prompting Machine for Audio-visual Segmentation

TL;DR

The paper tackles AVS by addressing cross-modal interaction and training instability in transformer-based architectures. It introduces CPM, which combines class-agnostic and class-conditional prompts, sampling class-specific embeddings from a Gaussian Mixture Model to guide both audio and visual prompting (CCDM, ACP, VCP) and adds a prompting-based contrastive learning objective (PCL). Through extensive experiments on AVSBench and VPO datasets, CPM achieves state-of-the-art segmentation performance, with ablations confirming the effectiveness of each component and the stability improvements in bipartite matching. The work advances AVS by enabling more explicit audio-visual alignment and robust instance-level segmentation, with potential for broader impact in multimodal perception tasks.

Abstract

Audio-visual segmentation (AVS) is an emerging task that aims to accurately segment sounding objects based on audio-visual cues. The success of AVS learning systems depends on the effectiveness of cross-modal interaction. Such a requirement can be naturally fulfilled by leveraging transformer-based segmentation architecture due to its inherent ability to capture long-range dependencies and flexibility in handling different modalities. However, the inherent training issues of transformer-based methods, such as the low efficacy of cross-attention and unstable bipartite matching, can be amplified in AVS, particularly when the learned audio query does not provide a clear semantic clue. In this paper, we address these two issues with the new Class-conditional Prompting Machine (CPM). CPM improves the bipartite matching with a learning strategy combining class-agnostic queries with class-conditional queries. The efficacy of cross-modal attention is upgraded with new learning objectives for the audio, visual and joint modalities. We conduct experiments on AVS benchmarks, demonstrating that our method achieves state-of-the-art (SOTA) segmentation accuracy.
Paper Structure (33 sections, 8 equations, 8 figures, 6 tables, 1 algorithm)

This paper contains 33 sections, 8 equations, 8 figures, 6 tables, 1 algorithm.

Figures (8)

  • Figure 1: Comparing conventional AVS methods chen2023closergao2023avsegformer with our CPM approach, CPM inherits the class-agnostic query from transformer-based methods and integrates class-conditional prompts sampled from the learned joint-modal data distribution to achieve three objectives: 1) learn disentangled audio partitioning, 2) facilitate semantic-guided object identification, and 3) promote more explicit audio-visual contrastive learning.
  • Figure 2: Illustration of our CPM method. Starting with a scene with a mixture of sound sources including Male, Female and Guitar, the training alternates between the use of learnable class-agnostic queries, and queries sampled from a list of class-specific query features from the GMM, denoted as $\mathbf{z}^{\text{male}}$, $\mathbf{z}^{\text{female}}$ and $\mathbf{z}^{\text{guitar}}$. The overall training objective is composed of three learning tasks: 1) in audio conditional prompting (ACP), we aim to use $\mathbf{z}^{\text{male}}$, $\mathbf{z}^{\text{female}}$ and $\mathbf{z}^{\text{guitar}}$ to recover the original magnitude spectrogram $\mathbf{a}_i$ from the noise spectrogram (i.e., $\mathbf{a}_i$ + $\mathbf{a}_j$) that is corrupted by another Dog audio signal $\mathbf{a}_j$; 2) a visual conditional prompting (VCP) that aim to probe the corresponding pixels w.r.t to the class-specific query features; and 3) a contrastive learning task that target to densely constrain the audio and visual representations. For training, both the CPM Workflow, indicated by the orange arrow, and the Class-agnostic Workflow, marked by the black arrow, are utilized. However, only the Class-agnostic Workflow is used for inference.
  • Figure 3: Qualitative (\ref{['fig:ablation-audio-dec-visual']}) and quantitative (\ref{['fig:ablation-audio-dec-numerical']}) comparisons between model components in a multi-source scenario (i.e., male singing, female singing and guitar)
  • Figure 4: Matching stability ($\mathsf{STS}\downarrow$) comparison on AVSBench-Semantics liu2023audio
  • Figure 5: Qualitative audio-visual segmentation results on AVSBench-Semantics zhou2023audio by TPAVI zhou2022audio, AVSegFormer gao2023avsegformer, CAVP chen2023closer and our CPM, which can be compared with the ground truth (GT) Ambulance of the first row.
  • ...and 3 more figures