Table of Contents
Fetching ...

TSE-PI: Target Sound Extraction under Reverberant Environments with Pitch Information

Yiwen Wang, Xihong Wu

TL;DR

This work proposes a TSE model provided with pitch information named TSE-PI, employing a learnable Gammatone filterbank in place of the convolutional encoder, and aims at improving the model's performance under reverberant environments.

Abstract

Target sound extraction (TSE) separates the target sound from the mixture signals based on provided clues. However, the performance of existing models significantly degrades under reverberant conditions. Inspired by auditory scene analysis (ASA), this work proposes a TSE model provided with pitch information named TSE-PI. Conditional pitch extraction is achieved through the Feature-wise Linearly Modulated layer with the sound-class label. A modified Waveformer model combined with pitch information, employing a learnable Gammatone filterbank in place of the convolutional encoder, is used for target sound extraction. The inclusion of pitch information is aimed at improving the model's performance. The experimental results on the FSD50K dataset illustrate 2.4 dB improvements of target sound extraction under reverberant environments when incorporating pitch information and Gammatone filterbank.

TSE-PI: Target Sound Extraction under Reverberant Environments with Pitch Information

TL;DR

This work proposes a TSE model provided with pitch information named TSE-PI, employing a learnable Gammatone filterbank in place of the convolutional encoder, and aims at improving the model's performance under reverberant environments.

Abstract

Target sound extraction (TSE) separates the target sound from the mixture signals based on provided clues. However, the performance of existing models significantly degrades under reverberant conditions. Inspired by auditory scene analysis (ASA), this work proposes a TSE model provided with pitch information named TSE-PI. Conditional pitch extraction is achieved through the Feature-wise Linearly Modulated layer with the sound-class label. A modified Waveformer model combined with pitch information, employing a learnable Gammatone filterbank in place of the convolutional encoder, is used for target sound extraction. The inclusion of pitch information is aimed at improving the model's performance. The experimental results on the FSD50K dataset illustrate 2.4 dB improvements of target sound extraction under reverberant environments when incorporating pitch information and Gammatone filterbank.
Paper Structure (12 sections, 3 equations, 4 figures, 4 tables)

This paper contains 12 sections, 3 equations, 4 figures, 4 tables.

Figures (4)

  • Figure 1: Overview of the proposed two-stage target sound extraction with pitch information (TSE-PI).
  • Figure 2: Condition pitch extraction model with FiLM. (To clearly display the FiLM module, only two FiLMs are drawn in the figure. In the actual model, each convolutional layer is modulated using FiLM.)
  • Figure 3: Target sound extraction with pitch information. (The part marked in red is the modification on Waveformer.)
  • Figure 4: Comparison results for SI-SDRi (dB) with different conditions under reverberant conditions.