Audio-Visual Instance Segmentation

Ruohao Guo; Xianghua Ying; Yaru Chen; Dantong Niu; Guangyao Li; Liao Qu; Yanyu Qi; Jinxing Zhou; Bowei Xing; Wenzhen Yue; Ji Shi; Qixun Wang; Peiliang Zhang; Buwen Liang

Audio-Visual Instance Segmentation

Ruohao Guo, Xianghua Ying, Yaru Chen, Dantong Niu, Guangyao Li, Liao Qu, Yanyu Qi, Jinxing Zhou, Bowei Xing, Wenzhen Yue, Ji Shi, Qixun Wang, Peiliang Zhang, Buwen Liang

TL;DR

This work defines audio-visual instance segmentation (AVIS) and introduces AVISeg, a long-form video benchmark with 26 categories and 94k instance masks across 926 videos. It presents AVISM, a baseline that localizes sound frames and tracks sounding objects via frame-level audio-visual fusion and a window-based video-level tracker, leveraging a compact token-based representation to handle long sequences. AVISM achieves state-of-the-art results on AVISeg across FSLA, HOTA, and mAP, while revealing that current multi-modal large models struggle with precise instance-level localization and temporal grounding. The dataset and baseline establish a foundation for advancing fine-grained, temporally-aware multi-modal understanding in realistic, audio-visual video settings.

Abstract

In this paper, we propose a new multi-modal task, termed audio-visual instance segmentation (AVIS), which aims to simultaneously identify, segment and track individual sounding object instances in audible videos. To facilitate this research, we introduce a high-quality benchmark named AVISeg, containing over 90K instance masks from 26 semantic categories in 926 long videos. Additionally, we propose a strong baseline model for this task. Our model first localizes sound source within each frame, and condenses object-specific contexts into concise tokens. Then it builds long-range audio-visual dependencies between these tokens using window-based attention, and tracks sounding objects among the entire video sequences. Extensive experiments reveal that our method performs best on AVISeg, surpassing the existing methods from related tasks. We further conduct the evaluation on several multi-modal large models. Unfortunately, they exhibits subpar performance on instance-level sound source localization and temporal perception. We expect that AVIS will inspire the community towards a more comprehensive multi-modal understanding. Dataset and code is available at https://github.com/ruohaoguo/avis.

Audio-Visual Instance Segmentation

TL;DR

Abstract

Paper Structure (33 sections, 4 equations, 10 figures, 7 tables, 1 algorithm)

This paper contains 33 sections, 4 equations, 10 figures, 7 tables, 1 algorithm.

Introduction
Related Work
Video Instance Segmentation
Audio-Visual Segmentation
New Task
Problem Definition
Evaluation Metrics
Dataset
Baseline Model
Audio-Visual Representation
Frame-Level Sound Source Localizer
Video-Level Sounding Object Tracker
Training Loss
Experiment
Main Results
...and 18 more sections

Figures (10)

Figure 1: Comparison of different audio-visual segmentation tasks. (a) Audio-Visual Object Segmentation (AVOS) only requires binary segmentation. (b) Audio-Visual Semantic Segmentation (AVSS) associates one category with every pixel. (c) Audio-Visual Instance Segmentation (AVIS) treats each sounding object of the same class as an individual instance.
Figure 2: Illustrations of our AVISeg dataset statistics. (a) Ratio of different sound sources. (b) Number of video in 4 real-world scenarios. (c) Distribution of video lengths. (d) Number of video and objects for the 26 categories. (e) Relations between different categories.
Figure 3: Overview of the proposed AVISM for audio-visual instance segmentation. (a) The frame-level sound source localizer segments sounding objects within each frame independently and condenses dense image features into frame queries. (b) The video-level sounding object tracker takes frame queries and audio features as input, and then performs temporal audio-visual communications between frames.
Figure 4: The architecture of our proposed video-level audio-visual fusion module. For the entire video sequence, it computes cross-attention between object tokens $\{\hat{Q}_{o,i}\}_{i=1}^T$ and audio features $\{f_i^A\}_{i=1}^T$ within local windows, and introduces cross-window connections by shifting windows.
Figure 5: Sample results of our baseline model on AVISeg dataset from four scenarios: (a) Music; (b) Speaking; (c) Machine; (d) Animal. Each row have six sampled frames from a video sequence. Zoom in to see details.
...and 5 more figures

Audio-Visual Instance Segmentation

TL;DR

Abstract

Audio-Visual Instance Segmentation

Authors

TL;DR

Abstract

Table of Contents

Figures (10)