Table of Contents
Fetching ...

Look, Listen and Segment: Towards Weakly Supervised Audio-visual Semantic Segmentation

Chengzhi Li, Heyan Huang, Ping Jian, Yanghao Zhou

Abstract

Audio-Visual Semantic Segmentation (AVSS) aligns audio and video at the pixel level but requires costly per-frame annotations. We introduce Weakly Supervised Audio-Visual Semantic Segmentation (WSAVSS), which uses only video-level labels to generate per-frame semantic masks of sounding objects. We decompose WSAVSS into looking, listening, and segmentation, and propose Progressive Cross-modal Alignment for Semantics (PCAS) with two modules: *Looking-before-Listening* and *Listening-before-Segmentation*. PCAS builds a classification task to train the audio-visual encoder using video labels, injects visual semantic prompts to enhance frame-level audio understanding, and then applies progressive contrastive alignment to map audio categories to image regions without mask annotations. Experiments show PCAS achieves state-of-the-art performance among weakly supervised methods on AVS and remains competitive with fully supervised baselines on AVSS, validating its effectiveness.

Look, Listen and Segment: Towards Weakly Supervised Audio-visual Semantic Segmentation

Abstract

Audio-Visual Semantic Segmentation (AVSS) aligns audio and video at the pixel level but requires costly per-frame annotations. We introduce Weakly Supervised Audio-Visual Semantic Segmentation (WSAVSS), which uses only video-level labels to generate per-frame semantic masks of sounding objects. We decompose WSAVSS into looking, listening, and segmentation, and propose Progressive Cross-modal Alignment for Semantics (PCAS) with two modules: *Looking-before-Listening* and *Listening-before-Segmentation*. PCAS builds a classification task to train the audio-visual encoder using video labels, injects visual semantic prompts to enhance frame-level audio understanding, and then applies progressive contrastive alignment to map audio categories to image regions without mask annotations. Experiments show PCAS achieves state-of-the-art performance among weakly supervised methods on AVS and remains competitive with fully supervised baselines on AVSS, validating its effectiveness.
Paper Structure (15 sections, 3 equations, 5 figures, 4 tables)

This paper contains 15 sections, 3 equations, 5 figures, 4 tables.

Figures (5)

  • Figure 1: Overview of PCAS: The audio encoder processes audio frame-wise with visual prompts from the visual encoder. CMC aligns global semantics; CMPC and CMCC refine fine-grained alignment via cross-modal similarity. The decoder then outputs masks using CAM-based pseudo-labels.
  • Figure 2: CMPC loss computation. ViT, AST*, $P_a$, $P_p$ denote the visual encoder, audio encoder with visual prompt, and projection layers for audio semantic tokens and visual patch tokens. Patches, AUX Patches, Patch Labels are ViT patch outputs, intermediate auxiliary patches, and positive/negative token-wise contrast labels.
  • Figure 3: t-SNE of three token types with and without CMC. Points are samples; shapes indicate token types; colors indicate categories. '$\blacktriangle$', '$\star$', and '•' denote visual classification tokens ($v_{cls}$), visual semantic tokens ($v_{sem}$), and audio semantic tokens ($a_{sem}$), respectively.
  • Figure 4: The comparison of output cases between PCAS and other weak supervision baselines on the AVS-S4 test dataset.
  • Figure 5: PCAS output cases on the AVSS test set. No weakly supervised baselines exist for comparison and shown.