Table of Contents
Fetching ...

Complementary and Contrastive Learning for Audio-Visual Segmentation

Sitong Gong, Yunzhi Zhuge, Lu Zhang, Pingping Zhang, Huchuan Lu

TL;DR

This work presents the Complementary and Contrastive Transformer (CCFormer), a novel framework adept at processing both local and global information and capturing spatial-temporal context comprehensively, and proposes the Bi-modal Contrastive Learning (BCL) to promote the alignment across both modalities in the unified feature space.

Abstract

Audio-Visual Segmentation (AVS) aims to generate pixel-wise segmentation maps that correlate with the auditory signals of objects. This field has seen significant progress with numerous CNN and Transformer-based methods enhancing the segmentation accuracy and robustness. Traditional CNN approaches manage audio-visual interactions through basic operations like padding and multiplications but are restricted by CNNs' limited local receptive field. More recently, Transformer-based methods treat auditory cues as queries, utilizing attention mechanisms to enhance audio-visual cooperation within frames. Nevertheless, they typically struggle to extract multimodal coefficients and temporal dynamics adequately. To overcome these limitations, we present the Complementary and Contrastive Transformer (CCFormer), a novel framework adept at processing both local and global information and capturing spatial-temporal context comprehensively. Our CCFormer initiates with the Early Integration Module (EIM) that employs a parallel bilateral architecture, merging multi-scale visual features with audio data to boost cross-modal complementarity. To extract the intra-frame spatial features and facilitate the perception of temporal coherence, we introduce the Multi-query Transformer Module (MTM), which dynamically endows audio queries with learning capabilities and models the frame and video-level relations simultaneously. Furthermore, we propose the Bi-modal Contrastive Learning (BCL) to promote the alignment across both modalities in the unified feature space. Through the effective combination of those designs, our method sets new state-of-the-art benchmarks across the S4, MS3 and AVSS datasets. Our source code and model weights will be made publicly available at https://github.com/SitongGong/CCFormer

Complementary and Contrastive Learning for Audio-Visual Segmentation

TL;DR

This work presents the Complementary and Contrastive Transformer (CCFormer), a novel framework adept at processing both local and global information and capturing spatial-temporal context comprehensively, and proposes the Bi-modal Contrastive Learning (BCL) to promote the alignment across both modalities in the unified feature space.

Abstract

Audio-Visual Segmentation (AVS) aims to generate pixel-wise segmentation maps that correlate with the auditory signals of objects. This field has seen significant progress with numerous CNN and Transformer-based methods enhancing the segmentation accuracy and robustness. Traditional CNN approaches manage audio-visual interactions through basic operations like padding and multiplications but are restricted by CNNs' limited local receptive field. More recently, Transformer-based methods treat auditory cues as queries, utilizing attention mechanisms to enhance audio-visual cooperation within frames. Nevertheless, they typically struggle to extract multimodal coefficients and temporal dynamics adequately. To overcome these limitations, we present the Complementary and Contrastive Transformer (CCFormer), a novel framework adept at processing both local and global information and capturing spatial-temporal context comprehensively. Our CCFormer initiates with the Early Integration Module (EIM) that employs a parallel bilateral architecture, merging multi-scale visual features with audio data to boost cross-modal complementarity. To extract the intra-frame spatial features and facilitate the perception of temporal coherence, we introduce the Multi-query Transformer Module (MTM), which dynamically endows audio queries with learning capabilities and models the frame and video-level relations simultaneously. Furthermore, we propose the Bi-modal Contrastive Learning (BCL) to promote the alignment across both modalities in the unified feature space. Through the effective combination of those designs, our method sets new state-of-the-art benchmarks across the S4, MS3 and AVSS datasets. Our source code and model weights will be made publicly available at https://github.com/SitongGong/CCFormer

Paper Structure

This paper contains 34 sections, 13 equations, 9 figures, 9 tables.

Figures (9)

  • Figure 1: Comparative frameworks in audio-visual segmentation (AVS): (a) AVS aims to produce mask sequences according to audio prompts. Previous approaches mainly rely on (b) FCN or (c) Transformer blocks to perform per-frame audio-visual integration, insufficiently addressing the intrinsic temporal coherence in both video and audio sources. On the contrary, (d) our CCFormer realizes a comprehensive spatial-temporal interaction between two modalities by building bidirectional integration, intra- and inter-frame interaction and query combination.
  • Figure 2: Overview of the CCFormer architecture. The model begins by employing the Early Integration Module to perform cross-frame bidirectional fusion of the audio and multi-scale visual features. Subsequently, the Multi-query Transformer Module introduces two types of queries and a progressive interaction strategy. Specifically, we utilize the Attention Query Generator to yield the intra-frame queries for cross-modal interactions with visual features within a single frame. Then the inter-frame queries are initialized to perform temporal interactions with the reshaped intra-frame queries. After multiple iterations, both types of queries are combined to generate the mask embedding. Finally, Bi-modal Contrastive Loss calculates the contrastive loss between audio features before and after early integration, further aligning the multimodal features.
  • Figure 3: Illustration of our Early Integration Module. It takes the integration process of $F^{i}_{v}$ and $F^{i-1}_{a}$ as an example. This module employs a bidirectional structure for initial feature fusion, where Audio-guided Vision and Vision-guided Audio Enhancement Modules leverage cross-attention mechanisms for bi-modal feature interaction and dimension alignment.
  • Figure 4: Bi-modal Contrastive Learning configuration. The contrastive loss is employed to compare the original audio features with the audio features fused with visual information. We consider the features of corresponding audio frames as positive (the blue blocks) and regard the audio frames from different audio samples within the same batch as negative (the solid white border).
  • Figure 5: Qualitative comparisons of TPAVI zhou2022audio, AVSegFormer gao2024avsegformer and our CCFormer on AVSBench-object dataset. We provide scenarios with scene transition.
  • ...and 4 more figures