CPM: Class-conditional Prompting Machine for Audio-visual Segmentation
Yuanhong Chen, Chong Wang, Yuyuan Liu, Hu Wang, Gustavo Carneiro
TL;DR
The paper tackles AVS by addressing cross-modal interaction and training instability in transformer-based architectures. It introduces CPM, which combines class-agnostic and class-conditional prompts, sampling class-specific embeddings from a Gaussian Mixture Model to guide both audio and visual prompting (CCDM, ACP, VCP) and adds a prompting-based contrastive learning objective (PCL). Through extensive experiments on AVSBench and VPO datasets, CPM achieves state-of-the-art segmentation performance, with ablations confirming the effectiveness of each component and the stability improvements in bipartite matching. The work advances AVS by enabling more explicit audio-visual alignment and robust instance-level segmentation, with potential for broader impact in multimodal perception tasks.
Abstract
Audio-visual segmentation (AVS) is an emerging task that aims to accurately segment sounding objects based on audio-visual cues. The success of AVS learning systems depends on the effectiveness of cross-modal interaction. Such a requirement can be naturally fulfilled by leveraging transformer-based segmentation architecture due to its inherent ability to capture long-range dependencies and flexibility in handling different modalities. However, the inherent training issues of transformer-based methods, such as the low efficacy of cross-attention and unstable bipartite matching, can be amplified in AVS, particularly when the learned audio query does not provide a clear semantic clue. In this paper, we address these two issues with the new Class-conditional Prompting Machine (CPM). CPM improves the bipartite matching with a learning strategy combining class-agnostic queries with class-conditional queries. The efficacy of cross-modal attention is upgraded with new learning objectives for the audio, visual and joint modalities. We conduct experiments on AVS benchmarks, demonstrating that our method achieves state-of-the-art (SOTA) segmentation accuracy.
