Table of Contents
Fetching ...

SoundBeam meets M2D: Target Sound Extraction with Audio Foundation Model

Carlos Hernandez-Olivan, Marc Delcroix, Tsubasa Ochiai, Daisuke Niizumi, Naohiro Tawara, Tomohiro Nakatani, Shoko Araki

TL;DR

The paper tackles target sound extraction (TSE) for arbitrary sounds by integrating a pre-trained audio foundation model, M2D, with the SoundBeam TSE system. It leverages M2D to generate target embeddings from enrollment clues and to provide a rich mixture representation through an adaptive input enhancer, improving discrimination and extraction quality. Experimental results on semhear data show notable SNR gains, particularly for enrollment-based inference, and demonstrate feasibility for online TSE by extending M2D benefits to Waveform-based models, albeit with caveats for causal processing. The study validates the potential of foundation models to boost TSE across cue types and lays out directions for causal training and efficient online deployment.

Abstract

Target sound extraction (TSE) consists of isolating a desired sound from a mixture of arbitrary sounds using clues to identify it. A TSE system requires solving two problems at once, identifying the target source and extracting the target signal from the mixture. For increased practicability, the same system should work with various types of sound. The duality of the problem and the wide variety of sounds make it challenging to train a powerful TSE system from scratch. In this paper, to tackle this problem, we explore using a pre-trained audio foundation model that can provide rich feature representations of sounds within a TSE system. We chose the masked-modeling duo (M2D) foundation model, which appears especially suited for the TSE task, as it is trained using a dual objective consisting of sound-label predictions and improved masked prediction. These objectives are related to sound identification and the signal extraction problems of TSE. We propose a new TSE system that integrates the feature representation from M2D into SoundBeam, which is a strong TSE system that can exploit both target sound class labels and pre-recorded enrollments (or audio queries) as clues. We show experimentally that using M2D can increase extraction performance, especially when employing enrollment clues.

SoundBeam meets M2D: Target Sound Extraction with Audio Foundation Model

TL;DR

The paper tackles target sound extraction (TSE) for arbitrary sounds by integrating a pre-trained audio foundation model, M2D, with the SoundBeam TSE system. It leverages M2D to generate target embeddings from enrollment clues and to provide a rich mixture representation through an adaptive input enhancer, improving discrimination and extraction quality. Experimental results on semhear data show notable SNR gains, particularly for enrollment-based inference, and demonstrate feasibility for online TSE by extending M2D benefits to Waveform-based models, albeit with caveats for causal processing. The study validates the potential of foundation models to boost TSE across cue types and lays out directions for causal training and efficient online deployment.

Abstract

Target sound extraction (TSE) consists of isolating a desired sound from a mixture of arbitrary sounds using clues to identify it. A TSE system requires solving two problems at once, identifying the target source and extracting the target signal from the mixture. For increased practicability, the same system should work with various types of sound. The duality of the problem and the wide variety of sounds make it challenging to train a powerful TSE system from scratch. In this paper, to tackle this problem, we explore using a pre-trained audio foundation model that can provide rich feature representations of sounds within a TSE system. We chose the masked-modeling duo (M2D) foundation model, which appears especially suited for the TSE task, as it is trained using a dual objective consisting of sound-label predictions and improved masked prediction. These objectives are related to sound identification and the signal extraction problems of TSE. We propose a new TSE system that integrates the feature representation from M2D into SoundBeam, which is a strong TSE system that can exploit both target sound class labels and pre-recorded enrollments (or audio queries) as clues. We show experimentally that using M2D can increase extraction performance, especially when employing enrollment clues.
Paper Structure (23 sections, 3 equations, 2 figures, 2 tables)

This paper contains 23 sections, 3 equations, 2 figures, 2 tables.

Figures (2)

  • Figure 1: a) Generic TSE system with SSL model and b) M2D model and AIE module.
  • Figure 2: SNR for different target sound classes using class label (top figure) or enrollment (bottom figure) clues.