Open-Vocabulary Audio-Visual Semantic Segmentation
Ruohao Guo, Liao Qu, Dantong Niu, Yanyu Qi, Wenzhen Yue, Ji Shi, Bowei Xing, Xianghua Ying
TL;DR
This work introduces open-vocabulary audio-visual semantic segmentation (open-vocabulary AVSS) to segment and label sound-emitting objects from open-set categories in videos. It proposes OV-AVSS, a dual-module framework comprising universal sound source localization (USSLM) for class-agnostic segmentation and open-vocabulary classification (OVCM) that leverages CLIP for category prediction, with audio-visual early fusion and an audio-conditioned transformer decoder to capture temporal dynamics. The method achieves strong zero-shot generalization on AVSBench-OV, delivering base-category mIoU of 55.43 and novel-category mIoU of 29.14, outperforming state-of-the-art zero-shot and open-vocabulary baselines by substantial margins. These results demonstrate the practical potential of combining audio-visual cues with vision-language priors to handle open-set sound sources in realistic video understanding tasks.
Abstract
Audio-visual semantic segmentation (AVSS) aims to segment and classify sounding objects in videos with acoustic cues. However, most approaches operate on the close-set assumption and only identify pre-defined categories from training data, lacking the generalization ability to detect novel categories in practical applications. In this paper, we introduce a new task: open-vocabulary audio-visual semantic segmentation, extending AVSS task to open-world scenarios beyond the annotated label space. This is a more challenging task that requires recognizing all categories, even those that have never been seen nor heard during training. Moreover, we propose the first open-vocabulary AVSS framework, OV-AVSS, which mainly consists of two parts: 1) a universal sound source localization module to perform audio-visual fusion and locate all potential sounding objects and 2) an open-vocabulary classification module to predict categories with the help of the prior knowledge from large-scale pre-trained vision-language models. To properly evaluate the open-vocabulary AVSS, we split zero-shot training and testing subsets based on the AVSBench-semantic benchmark, namely AVSBench-OV. Extensive experiments demonstrate the strong segmentation and zero-shot generalization ability of our model on all categories. On the AVSBench-OV dataset, OV-AVSS achieves 55.43% mIoU on base categories and 29.14% mIoU on novel categories, exceeding the state-of-the-art zero-shot method by 41.88%/20.61% and open-vocabulary method by 10.2%/11.6%. The code is available at https://github.com/ruohaoguo/ovavss.
