Table of Contents
Fetching ...

Leveraging Sound Source Trajectories for Universal Sound Separation

Donghang Wu, Xihong Wu, Tianshu Qu

TL;DR

The paper tackles universal sound separation for moving sources by coupling localization and separation through a mutual facilitation framework. It introduces a three-stage method: an envelope-based initial tracking stage, a mutual facilitation stage where target extraction and precise tracking iteratively refine each other, and a neural beamforming stage that yields the final single-channel output. Through simulations with moving sources under reverberation, the approach outperforms baselines, demonstrating that leveraging motion and iterative refinement improves both tracking accuracy and separation quality, even when the number of sources is unknown. This work highlights the synergistic potential of integrating localization cues into end-to-end separation pipelines, enabling more robust performance in dynamic, real-world environments.

Abstract

Existing methods utilizing spatial information for sound source separation require prior knowledge of the direction of arrival (DOA) of the source or utilize estimated but imprecise localization results, which impairs the separation performance, especially when the sound sources are moving. In fact, sound source localization and separation are interconnected problems, that is, sound source localization facilitates sound separation while sound separation contributes to refined source localization. This paper proposes a method utilizing the mutual facilitation mechanism between sound source localization and separation for moving sources. The proposed method comprises three stages. The first stage is initial tracking, which tracks each sound source from the audio mixture based on the source signal envelope estimation. These tracking results may lack sufficient accuracy. The second stage involves mutual facilitation: Sound separation is conducted using preliminary sound source tracking results. Subsequently, sound source tracking is performed on the separated signals, thereby refining the tracking precision. The refined trajectories further improve separation performance. This mutual facilitation process can be iterated multiple times. In the third stage, a neural beamformer estimates precise single-channel separation results based on the refined tracking trajectories and multi-channel separation outputs. Simulation experiments conducted under reverberant conditions and with moving sound sources demonstrate that the proposed method can achieve more accurate separation based on refined tracking results.

Leveraging Sound Source Trajectories for Universal Sound Separation

TL;DR

The paper tackles universal sound separation for moving sources by coupling localization and separation through a mutual facilitation framework. It introduces a three-stage method: an envelope-based initial tracking stage, a mutual facilitation stage where target extraction and precise tracking iteratively refine each other, and a neural beamforming stage that yields the final single-channel output. Through simulations with moving sources under reverberation, the approach outperforms baselines, demonstrating that leveraging motion and iterative refinement improves both tracking accuracy and separation quality, even when the number of sources is unknown. This work highlights the synergistic potential of integrating localization cues into end-to-end separation pipelines, enabling more robust performance in dynamic, real-world environments.

Abstract

Existing methods utilizing spatial information for sound source separation require prior knowledge of the direction of arrival (DOA) of the source or utilize estimated but imprecise localization results, which impairs the separation performance, especially when the sound sources are moving. In fact, sound source localization and separation are interconnected problems, that is, sound source localization facilitates sound separation while sound separation contributes to refined source localization. This paper proposes a method utilizing the mutual facilitation mechanism between sound source localization and separation for moving sources. The proposed method comprises three stages. The first stage is initial tracking, which tracks each sound source from the audio mixture based on the source signal envelope estimation. These tracking results may lack sufficient accuracy. The second stage involves mutual facilitation: Sound separation is conducted using preliminary sound source tracking results. Subsequently, sound source tracking is performed on the separated signals, thereby refining the tracking precision. The refined trajectories further improve separation performance. This mutual facilitation process can be iterated multiple times. In the third stage, a neural beamformer estimates precise single-channel separation results based on the refined tracking trajectories and multi-channel separation outputs. Simulation experiments conducted under reverberant conditions and with moving sound sources demonstrate that the proposed method can achieve more accurate separation based on refined tracking results.
Paper Structure (24 sections, 11 equations, 11 figures, 8 tables)

This paper contains 24 sections, 11 equations, 11 figures, 8 tables.

Figures (11)

  • Figure 1: The overall system of proposed method.
  • Figure 2: The schematic diagram of each stage. (a) Stage 1: the initial tracking module. (b) Stage 2: the mutual facilitation module. (c) Stage 3: the neural beamforming module.
  • Figure 3: The structure of modified SpatialNet for envelope estimation.
  • Figure 4: The architecture of DCSAnet.
  • Figure 5: The structure of modified SpatialNet for target sound extraction.
  • ...and 6 more figures