Table of Contents
Fetching ...

Dynamic Derivation and Elimination: Audio Visual Segmentation with Enhanced Audio Semantics

Chen Liu, Liying Yang, Peike Li, Dadong Wang, Lincheng Li, Xin Yu

TL;DR

This work tackles audio-driven audio-visual segmentation (AVS), where overlapping sounds and large intra-class variation impede robust cross-modal alignment. It introduces DDESeg, a framework with a Dynamic Derivation Module that constructs a semantic memory from single-source audio and derives multiple distinct audio representations from mixed signals, and a Dynamic Elimination Module that filters out non-matching audio cues using visual guidance. The system employs hierarchical cross-modal fusion to integrate refined audio semantics with visual features, and optimizes with a loss that combines dice, BCE, and IoU terms. Empirical results across AVS-Object, AVS-Semantic, and VPO benchmarks show state-of-the-art performance and clear gains in multi-source scenarios, demonstrating improved audio-visual alignment and sound attribution. The approach offers practical benefits for robust multimodal perception in complex auditory environments.

Abstract

Sound-guided object segmentation has drawn considerable attention for its potential to enhance multimodal perception. Previous methods primarily focus on developing advanced architectures to facilitate effective audio-visual interactions, without fully addressing the inherent challenges posed by audio natures, \emph{\ie}, (1) feature confusion due to the overlapping nature of audio signals, and (2) audio-visual matching difficulty from the varied sounds produced by the same object. To address these challenges, we propose Dynamic Derivation and Elimination (DDESeg): a novel audio-visual segmentation framework. Specifically, to mitigate feature confusion, DDESeg reconstructs the semantic content of the mixed audio signal by enriching the distinct semantic information of each individual source, deriving representations that preserve the unique characteristics of each sound. To reduce the matching difficulty, we introduce a discriminative feature learning module, which enhances the semantic distinctiveness of generated audio representations. Considering that not all derived audio representations directly correspond to visual features (e.g., off-screen sounds), we propose a dynamic elimination module to filter out non-matching elements. This module facilitates targeted interaction between sounding regions and relevant audio semantics. By scoring the interacted features, we identify and filter out irrelevant audio information, ensuring accurate audio-visual alignment. Comprehensive experiments demonstrate that our framework achieves superior performance in AVS datasets.

Dynamic Derivation and Elimination: Audio Visual Segmentation with Enhanced Audio Semantics

TL;DR

This work tackles audio-driven audio-visual segmentation (AVS), where overlapping sounds and large intra-class variation impede robust cross-modal alignment. It introduces DDESeg, a framework with a Dynamic Derivation Module that constructs a semantic memory from single-source audio and derives multiple distinct audio representations from mixed signals, and a Dynamic Elimination Module that filters out non-matching audio cues using visual guidance. The system employs hierarchical cross-modal fusion to integrate refined audio semantics with visual features, and optimizes with a loss that combines dice, BCE, and IoU terms. Empirical results across AVS-Object, AVS-Semantic, and VPO benchmarks show state-of-the-art performance and clear gains in multi-source scenarios, demonstrating improved audio-visual alignment and sound attribution. The approach offers practical benefits for robust multimodal perception in complex auditory environments.

Abstract

Sound-guided object segmentation has drawn considerable attention for its potential to enhance multimodal perception. Previous methods primarily focus on developing advanced architectures to facilitate effective audio-visual interactions, without fully addressing the inherent challenges posed by audio natures, \emph{\ie}, (1) feature confusion due to the overlapping nature of audio signals, and (2) audio-visual matching difficulty from the varied sounds produced by the same object. To address these challenges, we propose Dynamic Derivation and Elimination (DDESeg): a novel audio-visual segmentation framework. Specifically, to mitigate feature confusion, DDESeg reconstructs the semantic content of the mixed audio signal by enriching the distinct semantic information of each individual source, deriving representations that preserve the unique characteristics of each sound. To reduce the matching difficulty, we introduce a discriminative feature learning module, which enhances the semantic distinctiveness of generated audio representations. Considering that not all derived audio representations directly correspond to visual features (e.g., off-screen sounds), we propose a dynamic elimination module to filter out non-matching elements. This module facilitates targeted interaction between sounding regions and relevant audio semantics. By scoring the interacted features, we identify and filter out irrelevant audio information, ensuring accurate audio-visual alignment. Comprehensive experiments demonstrate that our framework achieves superior performance in AVS datasets.

Paper Structure

This paper contains 12 sections, 18 equations, 6 figures, 4 tables.

Figures (6)

  • Figure 1: The illustration of feature confusion and audio-visual matching difficulty. (a) Feature confusion denotes the challenge of distinguishing or separating individual sound sources in a mixed audio signal, especially with overlapping frequency, timbre, or spatial cues, which impedes accurate semantic extraction. (b) Significant amplitude and frequency variations within sounds produced by the same object lead to large intra-class variation, introducing audio-visual matching difficulty. This complicates the model's ability to align audio and visual modalities.
  • Figure 2: Three methods for achieving precise audio-visual alignment: (a) Audio Semantic Decomposition li2024qdformer: Models multi-source semantic space as a Cartesian product of single-source subspaces, employing product quantization and a shared codebook to decompose audio features into compact semantic tokens. (b) Audio Separation chen2024cpm: Devising a branch to decode audio-visual fused features into separated audio signals; (c) Audio Semantic Derivation and Elimination (Ours): Derives distinct semantic representations for each source from a mixed audio signal by exploring inter-class relationships. Furthermore, derived semantic representations are refined through intra-class relationships, while irrelevant audio representations are excluded under visual guidance.
  • Figure 3: Overview of DDESeg architecture and its key components: (a) Framework Architecture. Overview of our dual-branch framework that hierarchically processes and aligns audio-visual features. Through progressive multi-stage alignment, the framework fuses cross-modal information and generates precise pixel-wise segmentation maps via the segmentation head. (b) Dynamic Derivation Module. This module generates multiple audio representations from the input audio feature and explores intra-class relationships to equip discriminative features for each derived representation. (c) Dynamic Elimination Module. DEM eliminates audio representations that do not correspond to visual regions by evaluating the relevance between audio representations and learned image semantic representations.
  • Figure 4: Structure of the Feature Fusion Block in DDESeg.
  • Figure 5: Qualitative results on AVSBench-Semantic (§ \ref{['sec:qualitative_ana']}). Our method achieves precise localization in multi-source cases.
  • ...and 1 more figures