Table of Contents
Fetching ...

SeaVIS: Sound-Enhanced Association for Online Audio-Visual Instance Segmentation

Yingjian Zhu, Ying Wang, Yuyang Hong, Ruohao Guo, Kun Ding, Xin Gu, Bin Fan, Shiming Xiang

TL;DR

SeaVIS is introduced, the first online framework designed for audio-visual instance segmentation, and employs an Audio-Guided Contrastive Learning (AGCL) strategy to generate instance prototypes that encode not only visual appearance but also sounding activity, thereby significantly enhancing the audio-following capability of SeaVIS.

Abstract

Recently, an audio-visual instance segmentation (AVIS) task has been introduced, aiming to identify, segment and track individual sounding instances in videos. However, prevailing methods primarily adopt the offline paradigm, that cannot associate detected instances across consecutive clips, making them unsuitable for real-world scenarios that involve continuous video streams. To address this limitation, we introduce SeaVIS, the first online framework designed for audio-visual instance segmentation. SeaVIS leverages the Causal Cross Attention Fusion (CCAF) module to enable efficient online processing, which integrates visual features from the current frame with the entire audio history under strict causal constraints. A major challenge for conventional VIS methods is that appearance-based instance association fails to distinguish between an object's sounding and silent states, resulting in the incorrect segmentation of silent objects. To tackle this, we employ an Audio-Guided Contrastive Learning (AGCL) strategy to generate instance prototypes that encode not only visual appearance but also sounding activity. In this way, instances preserved during per-frame prediction that do not emit sound can be effectively suppressed during instance association process, thereby significantly enhancing the audio-following capability of SeaVIS. Extensive experiments conducted on the AVISeg dataset demonstrate that SeaVIS surpasses existing state-of-the-art models across multiple evaluation metrics while maintaining a competitive inference speed suitable for real-time processing.

SeaVIS: Sound-Enhanced Association for Online Audio-Visual Instance Segmentation

TL;DR

SeaVIS is introduced, the first online framework designed for audio-visual instance segmentation, and employs an Audio-Guided Contrastive Learning (AGCL) strategy to generate instance prototypes that encode not only visual appearance but also sounding activity, thereby significantly enhancing the audio-following capability of SeaVIS.

Abstract

Recently, an audio-visual instance segmentation (AVIS) task has been introduced, aiming to identify, segment and track individual sounding instances in videos. However, prevailing methods primarily adopt the offline paradigm, that cannot associate detected instances across consecutive clips, making them unsuitable for real-world scenarios that involve continuous video streams. To address this limitation, we introduce SeaVIS, the first online framework designed for audio-visual instance segmentation. SeaVIS leverages the Causal Cross Attention Fusion (CCAF) module to enable efficient online processing, which integrates visual features from the current frame with the entire audio history under strict causal constraints. A major challenge for conventional VIS methods is that appearance-based instance association fails to distinguish between an object's sounding and silent states, resulting in the incorrect segmentation of silent objects. To tackle this, we employ an Audio-Guided Contrastive Learning (AGCL) strategy to generate instance prototypes that encode not only visual appearance but also sounding activity. In this way, instances preserved during per-frame prediction that do not emit sound can be effectively suppressed during instance association process, thereby significantly enhancing the audio-following capability of SeaVIS. Extensive experiments conducted on the AVISeg dataset demonstrate that SeaVIS surpasses existing state-of-the-art models across multiple evaluation metrics while maintaining a competitive inference speed suitable for real-time processing.
Paper Structure (18 sections, 9 equations, 16 figures, 11 tables)

This paper contains 18 sections, 9 equations, 16 figures, 11 tables.

Figures (16)

  • Figure 1: Two key limitations of offline models. (A) Different inference paradigms between offline and online models. (A.1) Offline models process the entire sequence of video frames simultaneously and generate only video-level predictions. Consequently, the segmentation of each frame depends heavily on information from future frames. (A.2) In contrast, online models process input frames in a streaming, frame-by-frame manner, followed by post-processing to associate instances across frames. (B) Offline models face the continual inference gap problem: after completing inference on a fixed-length video segment, the model cannot incrementally process new frames. Consequently, newly predicted segments cannot be associated with previously identified instances (e.g., $p_1$ and $d_1$), leading to association failure such as $p_1 \ne p_3$ and $d_1 \ne d_2$.
  • Figure 2: Overview of the proposed SeaVIS. SeaVIS operates through two sequential stages: per-frame instance segmentation prediction followed by cross-frame instance association. We propose two key components in our framework: (a) A causal cross attention fusion module enabling cross-frame integration between visual and audio modalities under the temporal constraints of online processing; (b) Dual-level audio-guided contrastive learning at both frame and instance levels to optimize sound-aware instance embedding for audio-visual instance segmentation.
  • Figure 3: Illustration of Causal Cross Attention Fusion(CCAF). This module receives both visual and audio signals, and output the visual feature which is enhanced by audio features. At each frame, the visual features can integrate audio features from all previous and current frames using cross-attention.
  • Figure 4: Illustration of the Audio-Guided Contrastive Learning (AGCL) strategy. (a) A tracking example contained the essential components of the proposed strategy. At each frame, an anchor is obtained from the corresponding audio clip. The detection results at each time step include sounding instances, non-sounding instances, and background. (b) Frame-level contrastive learning: At each frame, the audio anchor attracts sounding instances while repelling both non-sounding instances and background. (c) Instance-level contrastive learning: For each instance tracked across time, an average audio anchor is computed from frames where the instance is sounding. This anchor pulls the sounding embeddings of that instance while repelling its non-sounding embeddings. (d) The learned sound-aware embeddings assist in filtering out silent instances during the association process.
  • Figure 5: Visualization of attention scores from CCAF module during inference. The red boxes highlight that for a given frame, the module assigns high attention scores to critical preceding audio time steps.
  • ...and 11 more figures