Table of Contents
Fetching ...

Open-Vocabulary Audio-Visual Semantic Segmentation

Ruohao Guo, Liao Qu, Dantong Niu, Yanyu Qi, Wenzhen Yue, Ji Shi, Bowei Xing, Xianghua Ying

TL;DR

This work introduces open-vocabulary audio-visual semantic segmentation (open-vocabulary AVSS) to segment and label sound-emitting objects from open-set categories in videos. It proposes OV-AVSS, a dual-module framework comprising universal sound source localization (USSLM) for class-agnostic segmentation and open-vocabulary classification (OVCM) that leverages CLIP for category prediction, with audio-visual early fusion and an audio-conditioned transformer decoder to capture temporal dynamics. The method achieves strong zero-shot generalization on AVSBench-OV, delivering base-category mIoU of 55.43 and novel-category mIoU of 29.14, outperforming state-of-the-art zero-shot and open-vocabulary baselines by substantial margins. These results demonstrate the practical potential of combining audio-visual cues with vision-language priors to handle open-set sound sources in realistic video understanding tasks.

Abstract

Audio-visual semantic segmentation (AVSS) aims to segment and classify sounding objects in videos with acoustic cues. However, most approaches operate on the close-set assumption and only identify pre-defined categories from training data, lacking the generalization ability to detect novel categories in practical applications. In this paper, we introduce a new task: open-vocabulary audio-visual semantic segmentation, extending AVSS task to open-world scenarios beyond the annotated label space. This is a more challenging task that requires recognizing all categories, even those that have never been seen nor heard during training. Moreover, we propose the first open-vocabulary AVSS framework, OV-AVSS, which mainly consists of two parts: 1) a universal sound source localization module to perform audio-visual fusion and locate all potential sounding objects and 2) an open-vocabulary classification module to predict categories with the help of the prior knowledge from large-scale pre-trained vision-language models. To properly evaluate the open-vocabulary AVSS, we split zero-shot training and testing subsets based on the AVSBench-semantic benchmark, namely AVSBench-OV. Extensive experiments demonstrate the strong segmentation and zero-shot generalization ability of our model on all categories. On the AVSBench-OV dataset, OV-AVSS achieves 55.43% mIoU on base categories and 29.14% mIoU on novel categories, exceeding the state-of-the-art zero-shot method by 41.88%/20.61% and open-vocabulary method by 10.2%/11.6%. The code is available at https://github.com/ruohaoguo/ovavss.

Open-Vocabulary Audio-Visual Semantic Segmentation

TL;DR

This work introduces open-vocabulary audio-visual semantic segmentation (open-vocabulary AVSS) to segment and label sound-emitting objects from open-set categories in videos. It proposes OV-AVSS, a dual-module framework comprising universal sound source localization (USSLM) for class-agnostic segmentation and open-vocabulary classification (OVCM) that leverages CLIP for category prediction, with audio-visual early fusion and an audio-conditioned transformer decoder to capture temporal dynamics. The method achieves strong zero-shot generalization on AVSBench-OV, delivering base-category mIoU of 55.43 and novel-category mIoU of 29.14, outperforming state-of-the-art zero-shot and open-vocabulary baselines by substantial margins. These results demonstrate the practical potential of combining audio-visual cues with vision-language priors to handle open-set sound sources in realistic video understanding tasks.

Abstract

Audio-visual semantic segmentation (AVSS) aims to segment and classify sounding objects in videos with acoustic cues. However, most approaches operate on the close-set assumption and only identify pre-defined categories from training data, lacking the generalization ability to detect novel categories in practical applications. In this paper, we introduce a new task: open-vocabulary audio-visual semantic segmentation, extending AVSS task to open-world scenarios beyond the annotated label space. This is a more challenging task that requires recognizing all categories, even those that have never been seen nor heard during training. Moreover, we propose the first open-vocabulary AVSS framework, OV-AVSS, which mainly consists of two parts: 1) a universal sound source localization module to perform audio-visual fusion and locate all potential sounding objects and 2) an open-vocabulary classification module to predict categories with the help of the prior knowledge from large-scale pre-trained vision-language models. To properly evaluate the open-vocabulary AVSS, we split zero-shot training and testing subsets based on the AVSBench-semantic benchmark, namely AVSBench-OV. Extensive experiments demonstrate the strong segmentation and zero-shot generalization ability of our model on all categories. On the AVSBench-OV dataset, OV-AVSS achieves 55.43% mIoU on base categories and 29.14% mIoU on novel categories, exceeding the state-of-the-art zero-shot method by 41.88%/20.61% and open-vocabulary method by 10.2%/11.6%. The code is available at https://github.com/ruohaoguo/ovavss.
Paper Structure (17 sections, 9 equations, 4 figures, 5 tables)

This paper contains 17 sections, 9 equations, 4 figures, 5 tables.

Figures (4)

  • Figure 1: An illustration of open-vocabulary audio-visual semantic segmentation. (a) Traditional AVSS models trained on closed-set classes (woman, piano, and car) fail to segment novel class (pipa). (b) Our open-vocabulary model correctly localizes sounding objects and recognizes arbitrary categories, e.g., pipa, without using any annotations.
  • Figure 2: Overview of the proposed OV-AVSS. (a) Universal Sound Source Localization: Given the image and audio features, the audio-visual early fusion module takes them as input and aligns them in spatial domain. Then, the fused features are passed into the pixel decoder and audio-conditioned Transformer decoder, which captures audio-visual dependencies in temporal domain and generates the class-agnostic mask for each sounding object. (b) Open-Vocabulary Classification: After localizing sounding objects and obtaining their masks, we crop the input frames with masks and feed into CLIP image encoder to generate image embeddings. They are then dot-producted with text embeddings generated by CLIP text encoder to obtain object categories.
  • Figure 3: The architecture of our proposed audio-conditioned Transformer decoder. $N$ class-independent object queries learn semantics from image features and capture audio-visual temporal dependencies from audio embeddings.
  • Figure 4: Qualitative results of our novel OV-AVSS framework on diverse audio-visual scenarios. Categories of predictions are shown in the title. Green text represents base categories, while red text denotes novel categories. (b) is multi-source scenarios containing both base and novel categories. (a) and (c) present single-source scenarios with one novel category.