Table of Contents
Fetching ...

SoundCompass: Navigating Target Sound Extraction With Effective Directional Clue Integration In Complex Acoustic Scenes

Dayun Choi, Jung-Woo Choi

TL;DR

The paper addresses target sound extraction (TSE) in complex acoustic scenes by leveraging direction-of-arrival (DoA) cues. It introduces SoundCompass, which combines a Spectral Pairwise Interaction (SPIN) module to capture cross-channel spatial correlations, spherical-harmonics (SH) DoA embeddings for continuous direction encoding, and a chain-of-inference (CoI) refinement to fuse DoA with spectral features across overlapping subbands. Key contributions include SPIN for cross-channel spatial cues, SH-based continuous DoA representation, FiLM-based subband fusion, band-split modulation, and iterative refinement with a sound-event decoder. Empirical results on ASA2-derived 4-channel data show improved SNR improvements and spatial fidelity with reasonable complexity, indicating strong potential for moving sources and real-world spatial scenes in hearing aids, AR/VR, and teleconferencing.

Abstract

Recent advances in target sound extraction (TSE) utilize directional clues derived from direction of arrival (DoA), which represent an inherent spatial property of sound available in any acoustic scene. However, previous DoA-based methods rely on hand-crafted features or discrete encodings, which lose fine-grained spatial information and limit adaptability. We propose SoundCompass, an effective directional clue integration framework centered on a Spectral Pairwise INteraction (SPIN) module that captures cross-channel spatial correlations in the complex spectrogram domain to preserve full spatial information in multichannel signals. The input feature expressed in terms of spatial correlations is fused with a DoA clue represented as spherical harmonics (SH) encoding. The fusion is carried out across overlapping frequency subbands, inheriting the benefits reported in the previous band-split architectures. We also incorporate the iterative refinement strategy, chain-of-inference (CoI), in the TSE framework, which recursively fuses DoA with sound event activation estimated from the previous inference stage. Experiments demonstrate that SoundCompass, combining SPIN, SH embedding, and CoI, robustly extracts target sources across diverse signal classes and spatial configurations.

SoundCompass: Navigating Target Sound Extraction With Effective Directional Clue Integration In Complex Acoustic Scenes

TL;DR

The paper addresses target sound extraction (TSE) in complex acoustic scenes by leveraging direction-of-arrival (DoA) cues. It introduces SoundCompass, which combines a Spectral Pairwise Interaction (SPIN) module to capture cross-channel spatial correlations, spherical-harmonics (SH) DoA embeddings for continuous direction encoding, and a chain-of-inference (CoI) refinement to fuse DoA with spectral features across overlapping subbands. Key contributions include SPIN for cross-channel spatial cues, SH-based continuous DoA representation, FiLM-based subband fusion, band-split modulation, and iterative refinement with a sound-event decoder. Empirical results on ASA2-derived 4-channel data show improved SNR improvements and spatial fidelity with reasonable complexity, indicating strong potential for moving sources and real-world spatial scenes in hearing aids, AR/VR, and teleconferencing.

Abstract

Recent advances in target sound extraction (TSE) utilize directional clues derived from direction of arrival (DoA), which represent an inherent spatial property of sound available in any acoustic scene. However, previous DoA-based methods rely on hand-crafted features or discrete encodings, which lose fine-grained spatial information and limit adaptability. We propose SoundCompass, an effective directional clue integration framework centered on a Spectral Pairwise INteraction (SPIN) module that captures cross-channel spatial correlations in the complex spectrogram domain to preserve full spatial information in multichannel signals. The input feature expressed in terms of spatial correlations is fused with a DoA clue represented as spherical harmonics (SH) encoding. The fusion is carried out across overlapping frequency subbands, inheriting the benefits reported in the previous band-split architectures. We also incorporate the iterative refinement strategy, chain-of-inference (CoI), in the TSE framework, which recursively fuses DoA with sound event activation estimated from the previous inference stage. Experiments demonstrate that SoundCompass, combining SPIN, SH embedding, and CoI, robustly extracts target sources across diverse signal classes and spatial configurations.

Paper Structure

This paper contains 10 sections, 1 equation, 4 figures, 1 table.

Figures (4)

  • Figure 1: (a) Overall architecture of SoundCompass for DoA-based target sound extraction and (b) details of a fusion module including a Spectral Pairwise INteraction (SPIN) module and integrating directional clue by feature-wise linear modulation (FiLM) for $K$ subbands.
  • Figure 2: Details of iterative refinement.
  • Figure 3: The t-SNE trajectories of the FiLM scale ($\gamma$) parameters across three subbands, with respect to azimuth (top, for 5 fixed elevations) and elevation (bottom, for 5 fixed azimuths).
  • Figure 4: An example of SI-SNRi contour maps within $\pm15^{\circ}$ from each target direction marked as "X" in a cuboid room of size [width, length, height] = [5.57, 5.20, 3.79] m with an RT60 of 0.32 s.