Spectron: Target Speaker Extraction using Conditional Transformer with Adversarial Refinement

Tathagata Bandyopadhyay

Spectron: Target Speaker Extraction using Conditional Transformer with Adversarial Refinement

Tathagata Bandyopadhyay

TL;DR

This paper proposes a transformer-based end-to-end model to extract a target speaker's speech from a monaural multi-speaker mixed audio signal and introduces two additional objectives to impose speaker embedding consistency and waveform encoder invertibility.

Abstract

Recently, attention-based transformers have become a de facto standard in many deep learning applications including natural language processing, computer vision, signal processing, etc.. In this paper, we propose a transformer-based end-to-end model to extract a target speaker's speech from a monaural multi-speaker mixed audio signal. Unlike existing speaker extraction methods, we introduce two additional objectives to impose speaker embedding consistency and waveform encoder invertibility and jointly train both speaker encoder and speech separator to better capture the speaker conditional embedding. Furthermore, we leverage a multi-scale discriminator to refine the perceptual quality of the extracted speech. Our experiments show that the use of a dual path transformer in the separator backbone along with proposed training paradigm improves the CNN baseline by $3.12$ dB points. Finally, we compare our approach with recent state-of-the-arts and show that our model outperforms existing methods by $4.1$ dB points on an average without creating additional data dependency.

Spectron: Target Speaker Extraction using Conditional Transformer with Adversarial Refinement

TL;DR

Abstract

dB points. Finally, we compare our approach with recent state-of-the-arts and show that our model outperforms existing methods by

dB points on an average without creating additional data dependency.

Paper Structure (18 sections, 8 equations, 1 figure, 2 tables)

This paper contains 18 sections, 8 equations, 1 figure, 2 tables.

Introduction
RELATED WORK
METHOD
Model Architecture
Speaker Encoder
Speech Separator
Design of Training Objectives
Waveform Reconstruction Quality
Speaker Embedding Consistency
Inverse Consistency
Adversarial Refinement
RESULTS
Dataset
Experimental Setup
Ablation Study
...and 3 more sections

Figures (1)

Figure 1: Spectron framework: same color blocks refers to shared weights; dashed boxes refers to objective functions. Speaker Embedding Consistency Loss (SECL) and waveform encoder decoder Inverse Consistency Loss (ICL) are realized as MSE Loss, where SI-SNR le2019sdr is used for Waveform Reconstruction Quality Loss (WRQL) and "MSD" kong2020hifi for discriminator loss.

Spectron: Target Speaker Extraction using Conditional Transformer with Adversarial Refinement

TL;DR

Abstract

Spectron: Target Speaker Extraction using Conditional Transformer with Adversarial Refinement

Authors

TL;DR

Abstract

Table of Contents

Figures (1)