Table of Contents
Fetching ...

Unleashing the Power of Natural Audio Featuring Multiple Sound Sources

Xize Cheng, Slytherin Wang, Zehan Wang, Rongjie Huang, Tao Jin, Zhou Zhao

TL;DR

ClearSep is proposed, an innovative framework that employs a data engine to decompose complex naturally mixed audio into multiple independent tracks, thereby allowing effective sound separation in real-world scenarios and proposes a series of training strategies tailored to these separated independent tracks to make the best use of them.

Abstract

Universal sound separation aims to extract clean audio tracks corresponding to distinct events from mixed audio, which is critical for artificial auditory perception. However, current methods heavily rely on artificially mixed audio for training, which limits their ability to generalize to naturally mixed audio collected in real-world environments. To overcome this limitation, we propose ClearSep, an innovative framework that employs a data engine to decompose complex naturally mixed audio into multiple independent tracks, thereby allowing effective sound separation in real-world scenarios. We introduce two remix-based evaluation metrics to quantitatively assess separation quality and use these metrics as thresholds to iteratively apply the data engine alongside model training, progressively optimizing separation performance. In addition, we propose a series of training strategies tailored to these separated independent tracks to make the best use of them. Extensive experiments demonstrate that ClearSep achieves state-of-the-art performance across multiple sound separation tasks, highlighting its potential for advancing sound separation in natural audio scenarios. For more examples and detailed results, please visit our demo page at https://clearsep.github.io.

Unleashing the Power of Natural Audio Featuring Multiple Sound Sources

TL;DR

ClearSep is proposed, an innovative framework that employs a data engine to decompose complex naturally mixed audio into multiple independent tracks, thereby allowing effective sound separation in real-world scenarios and proposes a series of training strategies tailored to these separated independent tracks to make the best use of them.

Abstract

Universal sound separation aims to extract clean audio tracks corresponding to distinct events from mixed audio, which is critical for artificial auditory perception. However, current methods heavily rely on artificially mixed audio for training, which limits their ability to generalize to naturally mixed audio collected in real-world environments. To overcome this limitation, we propose ClearSep, an innovative framework that employs a data engine to decompose complex naturally mixed audio into multiple independent tracks, thereby allowing effective sound separation in real-world scenarios. We introduce two remix-based evaluation metrics to quantitatively assess separation quality and use these metrics as thresholds to iteratively apply the data engine alongside model training, progressively optimizing separation performance. In addition, we propose a series of training strategies tailored to these separated independent tracks to make the best use of them. Extensive experiments demonstrate that ClearSep achieves state-of-the-art performance across multiple sound separation tasks, highlighting its potential for advancing sound separation in natural audio scenarios. For more examples and detailed results, please visit our demo page at https://clearsep.github.io.

Paper Structure

This paper contains 30 sections, 15 equations, 2 figures, 11 tables.

Figures (2)

  • Figure 1: Scale comparison between individual clean tracks and total tracks across seven different audio categories in AudioSet. The number of tracks in an audio clip is determined based on the number of audio categories present in AudioSet gemmeke2017audio. Clean Tracks refer to single-source audio containing only a single audio event, while Total Tracks represent the number of independent tracks corresponding to individual audio events extracted from both single-source and mixed-source audio samples.
  • Figure 2: Illustration of ClearSep and Data Engine Pipeline. ClearSep alternates between data engine and model training to progressively enhance sound separation performance and robustness. During the data engine phase, the model employs mutually exclusive class labels as queries to guide separation, ensuring that the separated tracks are independent. A quality filtering process then evaluates the separation results, and only tracks that meet the predefined criteria are incorporated into the single-source audio dataset. In the model training phase, the model is trained with both single-source audio and mixed-source audio, allowing it to achieve more accurate separation.