Audio-Visual Instance Segmentation
Ruohao Guo, Xianghua Ying, Yaru Chen, Dantong Niu, Guangyao Li, Liao Qu, Yanyu Qi, Jinxing Zhou, Bowei Xing, Wenzhen Yue, Ji Shi, Qixun Wang, Peiliang Zhang, Buwen Liang
TL;DR
This work defines audio-visual instance segmentation (AVIS) and introduces AVISeg, a long-form video benchmark with 26 categories and 94k instance masks across 926 videos. It presents AVISM, a baseline that localizes sound frames and tracks sounding objects via frame-level audio-visual fusion and a window-based video-level tracker, leveraging a compact token-based representation to handle long sequences. AVISM achieves state-of-the-art results on AVISeg across FSLA, HOTA, and mAP, while revealing that current multi-modal large models struggle with precise instance-level localization and temporal grounding. The dataset and baseline establish a foundation for advancing fine-grained, temporally-aware multi-modal understanding in realistic, audio-visual video settings.
Abstract
In this paper, we propose a new multi-modal task, termed audio-visual instance segmentation (AVIS), which aims to simultaneously identify, segment and track individual sounding object instances in audible videos. To facilitate this research, we introduce a high-quality benchmark named AVISeg, containing over 90K instance masks from 26 semantic categories in 926 long videos. Additionally, we propose a strong baseline model for this task. Our model first localizes sound source within each frame, and condenses object-specific contexts into concise tokens. Then it builds long-range audio-visual dependencies between these tokens using window-based attention, and tracks sounding objects among the entire video sequences. Extensive experiments reveal that our method performs best on AVISeg, surpassing the existing methods from related tasks. We further conduct the evaluation on several multi-modal large models. Unfortunately, they exhibits subpar performance on instance-level sound source localization and temporal perception. We expect that AVIS will inspire the community towards a more comprehensive multi-modal understanding. Dataset and code is available at https://github.com/ruohaoguo/avis.
