Table of Contents
Fetching ...

Cooperation Does Matter: Exploring Multi-Order Bilateral Relations for Audio-Visual Segmentation

Qi Yang, Xing Nie, Tong Li, Pengfei Gao, Ying Guo, Cheng Zhen, Pengfei Yan, Shiming Xiang

TL;DR

This work tackles audio-visual segmentation by introducing COMBO, a transformer-based framework that jointly models pixel-level, modality-level, and temporal relationships. It introduces three bilateral entanglements: pixel entanglement via a Siam-Encoder Module (SEM) that leverages Maskige priors from a frozen foundation model, modality entanglement via a Bilateral-Fusion Module (BFM) for bidirectional audio-visual fusion, and temporal entanglement via an adaptive inter-frame consistency loss. The approach achieves state-of-the-art results on AVSBench-object and AVSBench-semantic datasets, with ablations validating the contributions of SEM, BFM, and L_{ada}. The Maskige-based pixel conditioning and memory-efficient cross-modal fusion offer a practical pathway to robust, pixel-precise AVS in real-world video data, and the framework provides a blueprint for integrating foundation-model priors with multimodal transformers.

Abstract

Recently, an audio-visual segmentation (AVS) task has been introduced, aiming to group pixels with sounding objects within a given video. This task necessitates a first-ever audio-driven pixel-level understanding of the scene, posing significant challenges. In this paper, we propose an innovative audio-visual transformer framework, termed COMBO, an acronym for COoperation of Multi-order Bilateral relatiOns. For the first time, our framework explores three types of bilateral entanglements within AVS: pixel entanglement, modality entanglement, and temporal entanglement. Regarding pixel entanglement, we employ a Siam-Encoder Module (SEM) that leverages prior knowledge to generate more precise visual features from the foundational model. For modality entanglement, we design a Bilateral-Fusion Module (BFM), enabling COMBO to align corresponding visual and auditory signals bi-directionally. As for temporal entanglement, we introduce an innovative adaptive inter-frame consistency loss according to the inherent rules of temporal. Comprehensive experiments and ablation studies on AVSBench-object (84.7 mIoU on S4, 59.2 mIou on MS3) and AVSBench-semantic (42.1 mIoU on AVSS) datasets demonstrate that COMBO surpasses previous state-of-the-art methods. Code and more results will be publicly available at https://yannqi.github.io/AVS-COMBO/.

Cooperation Does Matter: Exploring Multi-Order Bilateral Relations for Audio-Visual Segmentation

TL;DR

This work tackles audio-visual segmentation by introducing COMBO, a transformer-based framework that jointly models pixel-level, modality-level, and temporal relationships. It introduces three bilateral entanglements: pixel entanglement via a Siam-Encoder Module (SEM) that leverages Maskige priors from a frozen foundation model, modality entanglement via a Bilateral-Fusion Module (BFM) for bidirectional audio-visual fusion, and temporal entanglement via an adaptive inter-frame consistency loss. The approach achieves state-of-the-art results on AVSBench-object and AVSBench-semantic datasets, with ablations validating the contributions of SEM, BFM, and L_{ada}. The Maskige-based pixel conditioning and memory-efficient cross-modal fusion offer a practical pathway to robust, pixel-precise AVS in real-world video data, and the framework provides a blueprint for integrating foundation-model priors with multimodal transformers.

Abstract

Recently, an audio-visual segmentation (AVS) task has been introduced, aiming to group pixels with sounding objects within a given video. This task necessitates a first-ever audio-driven pixel-level understanding of the scene, posing significant challenges. In this paper, we propose an innovative audio-visual transformer framework, termed COMBO, an acronym for COoperation of Multi-order Bilateral relatiOns. For the first time, our framework explores three types of bilateral entanglements within AVS: pixel entanglement, modality entanglement, and temporal entanglement. Regarding pixel entanglement, we employ a Siam-Encoder Module (SEM) that leverages prior knowledge to generate more precise visual features from the foundational model. For modality entanglement, we design a Bilateral-Fusion Module (BFM), enabling COMBO to align corresponding visual and auditory signals bi-directionally. As for temporal entanglement, we introduce an innovative adaptive inter-frame consistency loss according to the inherent rules of temporal. Comprehensive experiments and ablation studies on AVSBench-object (84.7 mIoU on S4, 59.2 mIou on MS3) and AVSBench-semantic (42.1 mIoU on AVSS) datasets demonstrate that COMBO surpasses previous state-of-the-art methods. Code and more results will be publicly available at https://yannqi.github.io/AVS-COMBO/.
Paper Structure (25 sections, 8 equations, 10 figures, 10 tables)

This paper contains 25 sections, 8 equations, 10 figures, 10 tables.

Figures (10)

  • Figure 1: Comparison between the proposed COMBO and existing state-of-the-art methods. Our COMBO is the first work to simultaneously explore multi-order bilateral relations in modality, temporal and pixel levels.
  • Figure 2: Overview of the proposed COMBO. COMBO adopts a novel audio-visual transformer framework specifically for audio-visual segmentation. Aiming at multi-oder bilateral entanglement, our method is composed of three independent modules. (1) We introduce the Siam-Encoder Module, which is designed for the exploration of pixel entanglement. (2) To integrate the entanglement of audio and visual signals, we propose a Bilateral-Fusion Module. (3) Given the inherent characteristics of temporal entanglement, we construct an adaptive inter-frame consistency loss in the segmentation module to enhance the consistency of the output.
  • Figure 3: Illustration of Bilateral-Fusion Module (BFM). We input both visual and image signals, which are subsequently processed through bilateral attention to yield the fused visual and image features respectively. We omit the subscripts of $H$ and $W$ for better understanding. For enhanced visibility, the dashed line indicates a skip connection. Best viewed in color.
  • Figure 4: Illustration of the impact on Adaptive Inter-frame Consistency Loss. We visualize the heat map of the predicted masks without and with the consideration of $\mathcal{L}_{ada}$ based on the S4 subset. The results indicate that implementing $\mathcal{L}_{ada}$ promotes superior interframe consistency. Best viewed in color.
  • Figure 5: Comparison of Visual Examples on the AVSBench-object and AVSBench-semantic Datasets with AVSBench 2023AVSS and AVSegformer 2023avsegformer. Wherein the leftmost example is derived from the S4 subset, the middle example is from the MS3 subset, and the rightmost example is from the AVSS subset. Red bounding boxes highlight the specific regions for comparison.
  • ...and 5 more figures