Table of Contents
Fetching ...

Weakly-supervised Audio Separation via Bi-modal Semantic Similarity

Tanvir Mahmud, Saeed Amizadeh, Kazuhito Koishida, Diana Marculescu

TL;DR

This work tackles language-conditioned audio separation without access to single-source audio during training by introducing a bi-modal framework that leverages pretrained audio-language embeddings (CLAP) to provide weak supervision from text prompts. The method combines an unsupervised mix-and-separate training regime with a CLAP-based weak supervision loss and a consistency constraint, then extends to a semi-supervised setting that also benefits from single-source data. Experimental results across MUSIC, VGGSound, and AudioCaps show large SDR gains over purely unsupervised baselines, with the weak supervision narrowing the gap to fully supervised performance; in semi-supervised scenarios, the framework can outperform fully supervised training with far less single-source data. The approach is modular and modality-agnostic, enabling potential cross-domain application to other conditional segmentation tasks via open-ended prompts and cross-modal similarity.

Abstract

Conditional sound separation in multi-source audio mixtures without having access to single source sound data during training is a long standing challenge. Existing mix-and-separate based methods suffer from significant performance drop with multi-source training mixtures due to the lack of supervision signal for single source separation cases during training. However, in the case of language-conditional audio separation, we do have access to corresponding text descriptions for each audio mixture in our training data, which can be seen as (rough) representations of the audio samples in the language modality. To this end, in this paper, we propose a generic bi-modal separation framework which can enhance the existing unsupervised frameworks to separate single-source signals in a target modality (i.e., audio) using the easily separable corresponding signals in the conditioning modality (i.e., language), without having access to single-source samples in the target modality during training. We empirically show that this is well within reach if we have access to a pretrained joint embedding model between the two modalities (i.e., CLAP). Furthermore, we propose to incorporate our framework into two fundamental scenarios to enhance separation performance. First, we show that our proposed methodology significantly improves the performance of purely unsupervised baselines by reducing the distribution shift between training and test samples. In particular, we show that our framework can achieve 71% boost in terms of Signal-to-Distortion Ratio (SDR) over the baseline, reaching 97.5% of the supervised learning performance. Second, we show that we can further improve the performance of the supervised learning itself by 17% if we augment it by our proposed weakly-supervised framework, that enables a powerful semi-supervised framework for audio separation.

Weakly-supervised Audio Separation via Bi-modal Semantic Similarity

TL;DR

This work tackles language-conditioned audio separation without access to single-source audio during training by introducing a bi-modal framework that leverages pretrained audio-language embeddings (CLAP) to provide weak supervision from text prompts. The method combines an unsupervised mix-and-separate training regime with a CLAP-based weak supervision loss and a consistency constraint, then extends to a semi-supervised setting that also benefits from single-source data. Experimental results across MUSIC, VGGSound, and AudioCaps show large SDR gains over purely unsupervised baselines, with the weak supervision narrowing the gap to fully supervised performance; in semi-supervised scenarios, the framework can outperform fully supervised training with far less single-source data. The approach is modular and modality-agnostic, enabling potential cross-domain application to other conditional segmentation tasks via open-ended prompts and cross-modal similarity.

Abstract

Conditional sound separation in multi-source audio mixtures without having access to single source sound data during training is a long standing challenge. Existing mix-and-separate based methods suffer from significant performance drop with multi-source training mixtures due to the lack of supervision signal for single source separation cases during training. However, in the case of language-conditional audio separation, we do have access to corresponding text descriptions for each audio mixture in our training data, which can be seen as (rough) representations of the audio samples in the language modality. To this end, in this paper, we propose a generic bi-modal separation framework which can enhance the existing unsupervised frameworks to separate single-source signals in a target modality (i.e., audio) using the easily separable corresponding signals in the conditioning modality (i.e., language), without having access to single-source samples in the target modality during training. We empirically show that this is well within reach if we have access to a pretrained joint embedding model between the two modalities (i.e., CLAP). Furthermore, we propose to incorporate our framework into two fundamental scenarios to enhance separation performance. First, we show that our proposed methodology significantly improves the performance of purely unsupervised baselines by reducing the distribution shift between training and test samples. In particular, we show that our framework can achieve 71% boost in terms of Signal-to-Distortion Ratio (SDR) over the baseline, reaching 97.5% of the supervised learning performance. Second, we show that we can further improve the performance of the supervised learning itself by 17% if we augment it by our proposed weakly-supervised framework, that enables a powerful semi-supervised framework for audio separation.
Paper Structure (43 sections, 13 equations, 11 figures, 16 tables, 1 algorithm)

This paper contains 43 sections, 13 equations, 11 figures, 16 tables, 1 algorithm.

Figures (11)

  • Figure 1: (Left) The proposed conditional audio separation framework. (Right) The comparison of our framework and the mix-and-separate baseline in unsupervised and semi-supervised settings.
  • Figure 2: Unsupervised mix-and-separate training with language conditioning (for $N=2, K=2$)
  • Figure 3: Our proposed weakly-supervised audio-language training framework: bi-modal contrastive loss ($\mathcal{L}_{CNT}$) combined with consistency reconstruction loss ($\mathcal{L}_{CRL}$).
  • Figure 4: Inference pipeline for the proposed language conditional sound separation framework.
  • Figure 5: Proposed conditional U-Net architecture. We incorporate three building blocks: residual block (ResBlock), self-attention (SA), and cross-attention (CA) blocks. The model is divided into two modules: the head and the modulator. The head network operates on fine-grained features and generates latent embedding. The modulator network modulates latent features based on cross-attention conditioning.
  • ...and 6 more figures