Bootstrapping Audio-Visual Segmentation by Strengthening Audio Cues

Tianxiang Chen; Zhentao Tan; Tao Gong; Qi Chu; Yue Wu; Bin Liu; Le Lu; Jieping Ye; Nenghai Yu

Bootstrapping Audio-Visual Segmentation by Strengthening Audio Cues

Tianxiang Chen, Zhentao Tan, Tao Gong, Qi Chu, Yue Wu, Bin Liu, Le Lu, Jieping Ye, Nenghai Yu

TL;DR

This work tackles modality imbalance in audio-visual segmentation by proposing AVSAC, which strengthens audio cues through a Bidirectional Audio-Visual Decoder (BAVD) with bidirectional bridges and introduces an Audio-Visual Frame-wise Synchrony (AVFS) loss for fine-grained audio-visual alignment. The dual-tower decoder enables continuous, mutual influence between audio-guided visual and vision-guided audio representations, improving joint audio-visual learning. AVSAC achieves state-of-the-art results on AVSBench sub-tasks (S4, MS3, AVSS), and ablations confirm the effectiveness of bidirectional bridges and frame-wise synchrony in balancing modalities. The approach advances practical AVS performance and provides a plug-in AVFS loss that can bolster audio-driven guidance in multi-modal segmentation settings.

Abstract

How to effectively interact audio with vision has garnered considerable interest within the multi-modality research field. Recently, a novel audio-visual segmentation (AVS) task has been proposed, aiming to segment the sounding objects in video frames under the guidance of audio cues. However, most existing AVS methods are hindered by a modality imbalance where the visual features tend to dominate those of the audio modality, due to a unidirectional and insufficient integration of audio cues. This imbalance skews the feature representation towards the visual aspect, impeding the learning of joint audio-visual representations and potentially causing segmentation inaccuracies. To address this issue, we propose AVSAC. Our approach features a Bidirectional Audio-Visual Decoder (BAVD) with integrated bidirectional bridges, enhancing audio cues and fostering continuous interplay between audio and visual modalities. This bidirectional interaction narrows the modality imbalance, facilitating more effective learning of integrated audio-visual representations. Additionally, we present a strategy for audio-visual frame-wise synchrony as fine-grained guidance of BAVD. This strategy enhances the share of auditory components in visual features, contributing to a more balanced audio-visual representation learning. Extensive experiments show that our method attains new benchmarks in AVS performance.

Bootstrapping Audio-Visual Segmentation by Strengthening Audio Cues

TL;DR

Abstract

Paper Structure (21 sections, 5 equations, 5 figures, 6 tables)

This paper contains 21 sections, 5 equations, 5 figures, 6 tables.

Introduction
Related Work
Audio-Visual Segmentation
Vision Transformer
Method
Overview
Bidirectional Audio-Visual Decoder (BAVD)
audio-guided vision (AGV) decoder branch
vision-guided audio (VGA) decoder branch
Audio-Visual Frame-wise Synchrony (AVFS)
Loss Function
Experiments
Datasets
Evaluation Metrics
Implementation Details
...and 6 more sections

Figures (5)

Figure 1: (a) Modality imbalance in the AVS task, where visual features tend to overshadow audio cues and impact the AVS result. We adopt the proportion of audio and visual feature components included in the final feature to measure the audio-visual imbalance degree, combining with visual results for analysis. The reasons for such imbalance are that (b) the audio-visual fusion mode of present methods zhou2022audiogao2023avsegformer is unidirectional and insufficient, with only audio cues as queries. (c) Our AVSAC can relieve modality imbalance and features a paralleled decoder structure with multiple bidirectional bridges linked within to strengthen audio cues through bidirectional and continuous audio-visual interaction. Also, we introduce audio-visual frame-wise synchrony to foster the integration of audio components into visual features for further modality imbalance alleviation.
Figure 2: Overall architecture of AVSAC. We propose two key components in this framework: (1) Bidirectional Audio-Visual Decoder (BAVD), enabling the model to strengthen audio cues by consistently balancing AGV with VGA through continuous and in-depth bilateral modality interaction. (2) Audio-Visual Frame-wise Synchrony (AVFS) module is proposed as a more fine-grained (frame-wise) guidance to our BAVD. It can help exploit auditory components to visual features to increase the importance of audio cues.
Figure 3: Illustration of the Internal module structure of Bidirectional Audio-Visual Vision Decoder (BAVD).
Figure 4: Qualitative examples of the AVSBench, AVSegFormer, and our AVSAC framework. The left part shows the video frame outputs of the S4 setting, the middle part offers the video frame outputs of the MS3 setting, and the right part is the video frame outputs of the AVSS setting. The other two methods only produce segmentation maps that are not that precise, whereas our AVSAC can not only significantly evade some false alarms but also more accurately delineate the shapes of sounding objects.
Figure 5: Visual comparison of AVS results over consecutive frames with and without multiple bidirectional bridges. The red areas in attention heat maps highlight the sounding objects.

Bootstrapping Audio-Visual Segmentation by Strengthening Audio Cues

TL;DR

Abstract

Bootstrapping Audio-Visual Segmentation by Strengthening Audio Cues

Authors

TL;DR

Abstract

Table of Contents

Figures (5)