Table of Contents
Fetching ...

Universal Source Separation with Weakly Labelled Data

Qiuqiang Kong, Ke Chen, Haohe Liu, Xingjian Du, Taylor Berg-Kirkpatrick, Shlomo Dubnov, Mark D. Plumbley

TL;DR

This work tackles universal source separation (USS) by leveraging large-scale weakly labeled data (AudioSet) to train a single model capable of separating hundreds of sound classes. It introduces a query-based conditioning framework where anchor segments mined from weak labels guide a FiLM-conditioned ResUNet separator, with several embedding strategies (hard/soft/latent/learnable) and hierarchical inference over the AudioSet ontology. The approach achieves substantial SDRi gains across diverse tasks (AudioSet, FSDKaggle2018, FSD50k, MUSDB18, Slakh2100, Voicebank-Demand) without relying on clean source labels, and provides extensive ablations to illuminate the effects of embedding type, anchor duration, data augmentation, and architecture depth. The results demonstrate the feasibility and potential of scalable USS and offer open-source code to support further development and benchmarking.

Abstract

Universal source separation (USS) is a fundamental research task for computational auditory scene analysis, which aims to separate mono recordings into individual source tracks. There are three potential challenges awaiting the solution to the audio source separation task. First, previous audio source separation systems mainly focus on separating one or a limited number of specific sources. There is a lack of research on building a unified system that can separate arbitrary sources via a single model. Second, most previous systems require clean source data to train a separator, while clean source data are scarce. Third, there is a lack of USS system that can automatically detect and separate active sound classes in a hierarchical level. To use large-scale weakly labeled/unlabeled audio data for audio source separation, we propose a universal audio source separation framework containing: 1) an audio tagging model trained on weakly labeled data as a query net; and 2) a conditional source separation model that takes query net outputs as conditions to separate arbitrary sound sources. We investigate various query nets, source separation models, and training strategies and propose a hierarchical USS strategy to automatically detect and separate sound classes from the AudioSet ontology. By solely leveraging the weakly labelled AudioSet, our USS system is successful in separating a wide variety of sound classes, including sound event separation, music source separation, and speech enhancement. The USS system achieves an average signal-to-distortion ratio improvement (SDRi) of 5.57 dB over 527 sound classes of AudioSet; 10.57 dB on the DCASE 2018 Task 2 dataset; 8.12 dB on the MUSDB18 dataset; an SDRi of 7.28 dB on the Slakh2100 dataset; and an SSNR of 9.00 dB on the voicebank-demand dataset. We release the source code at https://github.com/bytedance/uss

Universal Source Separation with Weakly Labelled Data

TL;DR

This work tackles universal source separation (USS) by leveraging large-scale weakly labeled data (AudioSet) to train a single model capable of separating hundreds of sound classes. It introduces a query-based conditioning framework where anchor segments mined from weak labels guide a FiLM-conditioned ResUNet separator, with several embedding strategies (hard/soft/latent/learnable) and hierarchical inference over the AudioSet ontology. The approach achieves substantial SDRi gains across diverse tasks (AudioSet, FSDKaggle2018, FSD50k, MUSDB18, Slakh2100, Voicebank-Demand) without relying on clean source labels, and provides extensive ablations to illuminate the effects of embedding type, anchor duration, data augmentation, and architecture depth. The results demonstrate the feasibility and potential of scalable USS and offer open-source code to support further development and benchmarking.

Abstract

Universal source separation (USS) is a fundamental research task for computational auditory scene analysis, which aims to separate mono recordings into individual source tracks. There are three potential challenges awaiting the solution to the audio source separation task. First, previous audio source separation systems mainly focus on separating one or a limited number of specific sources. There is a lack of research on building a unified system that can separate arbitrary sources via a single model. Second, most previous systems require clean source data to train a separator, while clean source data are scarce. Third, there is a lack of USS system that can automatically detect and separate active sound classes in a hierarchical level. To use large-scale weakly labeled/unlabeled audio data for audio source separation, we propose a universal audio source separation framework containing: 1) an audio tagging model trained on weakly labeled data as a query net; and 2) a conditional source separation model that takes query net outputs as conditions to separate arbitrary sound sources. We investigate various query nets, source separation models, and training strategies and propose a hierarchical USS strategy to automatically detect and separate sound classes from the AudioSet ontology. By solely leveraging the weakly labelled AudioSet, our USS system is successful in separating a wide variety of sound classes, including sound event separation, music source separation, and speech enhancement. The USS system achieves an average signal-to-distortion ratio improvement (SDRi) of 5.57 dB over 527 sound classes of AudioSet; 10.57 dB on the DCASE 2018 Task 2 dataset; 8.12 dB on the MUSDB18 dataset; an SDRi of 7.28 dB on the Slakh2100 dataset; and an SSNR of 9.00 dB on the voicebank-demand dataset. We release the source code at https://github.com/bytedance/uss
Paper Structure (42 sections, 13 equations, 7 figures, 10 tables, 4 algorithms)

This paper contains 42 sections, 13 equations, 7 figures, 10 tables, 4 algorithms.

Figures (7)

  • Figure 1: The standard architecture of deep-learning-based audio source separation model. Left top: synthesis-based separation model. Left bottom: mask-based separation model. Right: the general type of frequency-domain separation model.
  • Figure 2: Left: Clean source data of sound class "Flute". Right: Weakly labelled data of sound class "Air horn, truck horn" which only occurs between 2.5s - 4.0s.
  • Figure 3: The architecture of our proposed query-based audio source separation pipeline trained from weakly-labeld data, including datasets, sampling strategies, audio tagging model, and conditional audio source separation models.
  • Figure 4: Top: log mel spectrogram of a 10-second audio clip from AudioSet; Middle: predicted SED probability of “Speech”, where red block shows the selected anchor segment; Bottom: predicted audio tagging probabilities of the anchor segment.
  • Figure 5: Two audio tagging models for audio classification, sound event detection, and latent feature production. Left: Pretrained Audio Neural Networks (PANN) in CNN14 architecture. Right: Hierarchical Token-Semantic Transformer (HTS-AT) in 4-block architecture.
  • ...and 2 more figures