Table of Contents
Fetching ...

Audio-Visual Segmentation with Semantics

Jinxing Zhou, Xuyang Shen, Jianyuan Wang, Jiayi Zhang, Weixuan Sun, Jing Zhang, Stan Birchfield, Dan Guo, Lingpeng Kong, Meng Wang, Yiran Zhong

TL;DR

This work defines audio-visual segmentation (AVS) and releases AVSBench, a pixel-level AVS benchmark with three settings: S4 (semi-supervised single-source), MS3 (fully supervised multi-sources), and AVSS (fully supervised semantic segmentation). It presents a TPAVI-based baseline that fuses whole-video audio with per-frame visual features via temporal pixel-wise interactions and a KL-based AVM regularizer to enforce audio–visual alignment. Experiments show TPAVI and AVM together yield strong gains over SSL/VOS/SOD baselines, with benefits from multi-stage fusion and pretraining on the Single-source subset; AVSS demonstrates the additional challenge of semantic labeling. The dataset and online benchmark, along with code, aim to bridge audio and pixel-level visual semantics and spur progress in multi-modal segmentation research.

Abstract

We propose a new problem called audio-visual segmentation (AVS), in which the goal is to output a pixel-level map of the object(s) that produce sound at the time of the image frame. To facilitate this research, we construct the first audio-visual segmentation benchmark, i.e., AVSBench, providing pixel-wise annotations for sounding objects in audible videos. It contains three subsets: AVSBench-object (Single-source subset, Multi-sources subset) and AVSBench-semantic (Semantic-labels subset). Accordingly, three settings are studied: 1) semi-supervised audio-visual segmentation with a single sound source; 2) fully-supervised audio-visual segmentation with multiple sound sources, and 3) fully-supervised audio-visual semantic segmentation. The first two settings need to generate binary masks of sounding objects indicating pixels corresponding to the audio, while the third setting further requires generating semantic maps indicating the object category. To deal with these problems, we propose a new baseline method that uses a temporal pixel-wise audio-visual interaction module to inject audio semantics as guidance for the visual segmentation process. We also design a regularization loss to encourage audio-visual mapping during training. Quantitative and qualitative experiments on AVSBench compare our approach to several existing methods for related tasks, demonstrating that the proposed method is promising for building a bridge between the audio and pixel-wise visual semantics. Code is available at https://github.com/OpenNLPLab/AVSBench. Online benchmark is available at http://www.avlbench.opennlplab.cn.

Audio-Visual Segmentation with Semantics

TL;DR

This work defines audio-visual segmentation (AVS) and releases AVSBench, a pixel-level AVS benchmark with three settings: S4 (semi-supervised single-source), MS3 (fully supervised multi-sources), and AVSS (fully supervised semantic segmentation). It presents a TPAVI-based baseline that fuses whole-video audio with per-frame visual features via temporal pixel-wise interactions and a KL-based AVM regularizer to enforce audio–visual alignment. Experiments show TPAVI and AVM together yield strong gains over SSL/VOS/SOD baselines, with benefits from multi-stage fusion and pretraining on the Single-source subset; AVSS demonstrates the additional challenge of semantic labeling. The dataset and online benchmark, along with code, aim to bridge audio and pixel-level visual semantics and spur progress in multi-modal segmentation research.

Abstract

We propose a new problem called audio-visual segmentation (AVS), in which the goal is to output a pixel-level map of the object(s) that produce sound at the time of the image frame. To facilitate this research, we construct the first audio-visual segmentation benchmark, i.e., AVSBench, providing pixel-wise annotations for sounding objects in audible videos. It contains three subsets: AVSBench-object (Single-source subset, Multi-sources subset) and AVSBench-semantic (Semantic-labels subset). Accordingly, three settings are studied: 1) semi-supervised audio-visual segmentation with a single sound source; 2) fully-supervised audio-visual segmentation with multiple sound sources, and 3) fully-supervised audio-visual semantic segmentation. The first two settings need to generate binary masks of sounding objects indicating pixels corresponding to the audio, while the third setting further requires generating semantic maps indicating the object category. To deal with these problems, we propose a new baseline method that uses a temporal pixel-wise audio-visual interaction module to inject audio semantics as guidance for the visual segmentation process. We also design a regularization loss to encourage audio-visual mapping during training. Quantitative and qualitative experiments on AVSBench compare our approach to several existing methods for related tasks, demonstrating that the proposed method is promising for building a bridge between the audio and pixel-wise visual semantics. Code is available at https://github.com/OpenNLPLab/AVSBench. Online benchmark is available at http://www.avlbench.opennlplab.cn.
Paper Structure (14 sections, 2 equations, 14 figures, 8 tables)

This paper contains 14 sections, 2 equations, 14 figures, 8 tables.

Figures (14)

  • Figure 1: Comparison of the proposed AVS task with the Sound source localization (SSL) task. SSL aims to estimate an approximate location of the sounding objects in the visual frame, at a patch level. In contrast, AVS estimates pixel-wise masks for all the sounding objects, regardless of the number of visible sounding objects. The segmentation masks can be binary or semantic under different task settings. The binary masks indicate objects making sounds while the semantic masks further distinguish the object category. In the last row, the ground truths are displayed with the semantic masks.
  • Figure 2: Statistics of the AVSBench dataset extension, i.e., the AVSBench-semantic dataset. There are 70 categories in the extension and the video number of each category is given.
  • Figure 3: AVSBench samples. The AVSBench dataset contains the Single-source subset (a), Multi-sources subset (b), and Semantic-labels subset which mainly contains the multi-source videos (c). Each video is divided into 5 clips for the first two, while 10 clips for the latter, as shown. Annotated clips are indicated by brown framing rectangles while the green rectangles represent there are no sounding objects in those frames; the name of sounding objects is indicated by red text. Binary masks of the sounding objects are annotated in the first two, reflected by the orange masks in (a) and (b). The third subset provides colorful semantic masks indicating different object categories. Note that for the Single-source training set of AVSBench, only the first frame of each video is annotated, whereas all of the extracted frames are annotated for all other sets.
  • Figure 4: Overview of the Baseline, which follows a hierarchical Encoder-Decoder pipeline. The encoder takes the video frames and the entire audio clip as inputs, and outputs visual and audio features, respectively denoted as $\bm{F}_i$ and $\bm{A}$. The visual feature map $\bm{F}_i$ at each stage is further sent to the ASPP chen2017deeplab module and then our TPAVI module (introduced in Sec. \ref{['sec:approach']}). ASPP provides different receptive fields for recognizing visual objects, while TPAVI focuses on the temporal pixel-wise audio-visual interaction. The decoder progressively enlarges the fused feature maps by four stages and finally generates the output mask $\bm{M}$ for sounding objects.
  • Figure 5: The TPAVI module takes the $i$-th stage visual feature $\bm{V}_i$ and the audio feature $\bm{A}$ as inputs. The colored boxes represent $1 \times 1 \times 1$ convolutions, while the yellow boxes indicate reshaping operations. The symbols "$\otimes$" and "$\oplus$" denote matrix multiplication and element-wise addition, respectively.
  • ...and 9 more figures