Audio-Visual Segmentation with Semantics
Jinxing Zhou, Xuyang Shen, Jianyuan Wang, Jiayi Zhang, Weixuan Sun, Jing Zhang, Stan Birchfield, Dan Guo, Lingpeng Kong, Meng Wang, Yiran Zhong
TL;DR
This work defines audio-visual segmentation (AVS) and releases AVSBench, a pixel-level AVS benchmark with three settings: S4 (semi-supervised single-source), MS3 (fully supervised multi-sources), and AVSS (fully supervised semantic segmentation). It presents a TPAVI-based baseline that fuses whole-video audio with per-frame visual features via temporal pixel-wise interactions and a KL-based AVM regularizer to enforce audio–visual alignment. Experiments show TPAVI and AVM together yield strong gains over SSL/VOS/SOD baselines, with benefits from multi-stage fusion and pretraining on the Single-source subset; AVSS demonstrates the additional challenge of semantic labeling. The dataset and online benchmark, along with code, aim to bridge audio and pixel-level visual semantics and spur progress in multi-modal segmentation research.
Abstract
We propose a new problem called audio-visual segmentation (AVS), in which the goal is to output a pixel-level map of the object(s) that produce sound at the time of the image frame. To facilitate this research, we construct the first audio-visual segmentation benchmark, i.e., AVSBench, providing pixel-wise annotations for sounding objects in audible videos. It contains three subsets: AVSBench-object (Single-source subset, Multi-sources subset) and AVSBench-semantic (Semantic-labels subset). Accordingly, three settings are studied: 1) semi-supervised audio-visual segmentation with a single sound source; 2) fully-supervised audio-visual segmentation with multiple sound sources, and 3) fully-supervised audio-visual semantic segmentation. The first two settings need to generate binary masks of sounding objects indicating pixels corresponding to the audio, while the third setting further requires generating semantic maps indicating the object category. To deal with these problems, we propose a new baseline method that uses a temporal pixel-wise audio-visual interaction module to inject audio semantics as guidance for the visual segmentation process. We also design a regularization loss to encourage audio-visual mapping during training. Quantitative and qualitative experiments on AVSBench compare our approach to several existing methods for related tasks, demonstrating that the proposed method is promising for building a bridge between the audio and pixel-wise visual semantics. Code is available at https://github.com/OpenNLPLab/AVSBench. Online benchmark is available at http://www.avlbench.opennlplab.cn.
