Table of Contents
Fetching ...

SEGAN: Speech Enhancement Generative Adversarial Network

Santiago Pascual, Antonio Bonafonte, Joan Serrà

TL;DR

SEGAN introduces an end-to-end, waveform-domain speech enhancement method based on a generative adversarial network. The fully convolutional encoder–decoder generator together with skip connections denoises raw speech chunks by leveraging an adversarial loss plus a compensating $ extbf{L}_1$ term, trained on a large multi-speaker, multi-noise dataset. Objective metrics show competitive PESQ with improvements in CSIG, CBAK, COVL, and SSNR, while subjective listening tests favor SEGAN over both the noisy input and Wiener baselines. The work demonstrates the viability of generative architectures for audio enhancement and paves the way for perceptually weighted or more advanced architectures in future work.

Abstract

Current speech enhancement techniques operate on the spectral domain and/or exploit some higher-level feature. The majority of them tackle a limited number of noise conditions and rely on first-order statistics. To circumvent these issues, deep networks are being increasingly used, thanks to their ability to learn complex functions from large example sets. In this work, we propose the use of generative adversarial networks for speech enhancement. In contrast to current techniques, we operate at the waveform level, training the model end-to-end, and incorporate 28 speakers and 40 different noise conditions into the same model, such that model parameters are shared across them. We evaluate the proposed model using an independent, unseen test set with two speakers and 20 alternative noise conditions. The enhanced samples confirm the viability of the proposed model, and both objective and subjective evaluations confirm the effectiveness of it. With that, we open the exploration of generative architectures for speech enhancement, which may progressively incorporate further speech-centric design choices to improve their performance.

SEGAN: Speech Enhancement Generative Adversarial Network

TL;DR

SEGAN introduces an end-to-end, waveform-domain speech enhancement method based on a generative adversarial network. The fully convolutional encoder–decoder generator together with skip connections denoises raw speech chunks by leveraging an adversarial loss plus a compensating term, trained on a large multi-speaker, multi-noise dataset. Objective metrics show competitive PESQ with improvements in CSIG, CBAK, COVL, and SSNR, while subjective listening tests favor SEGAN over both the noisy input and Wiener baselines. The work demonstrates the viability of generative architectures for audio enhancement and paves the way for perceptually weighted or more advanced architectures in future work.

Abstract

Current speech enhancement techniques operate on the spectral domain and/or exploit some higher-level feature. The majority of them tackle a limited number of noise conditions and rely on first-order statistics. To circumvent these issues, deep networks are being increasingly used, thanks to their ability to learn complex functions from large example sets. In this work, we propose the use of generative adversarial networks for speech enhancement. In contrast to current techniques, we operate at the waveform level, training the model end-to-end, and incorporate 28 speakers and 40 different noise conditions into the same model, such that model parameters are shared across them. We evaluate the proposed model using an independent, unseen test set with two speakers and 20 alternative noise conditions. The enhanced samples confirm the viability of the proposed model, and both objective and subjective evaluations confirm the effectiveness of it. With that, we open the exploration of generative architectures for speech enhancement, which may progressively incorporate further speech-centric design choices to improve their performance.

Paper Structure

This paper contains 11 sections, 5 equations, 4 figures, 2 tables.

Figures (4)

  • Figure 1: GAN training process. First, D back-props a batch of real examples. Then, D back-props a batch of fake examples that come from G, and classifies them as fake. Finally, D's parameters are frozen and G back-props to make D misclassify.
  • Figure 2: Encoder-decoder architecture for speech enhancement (G network). The arrows between encoder and decoder blocks denote skip connections.
  • Figure 3: Adversarial training for speech enhancement. Dashed lines represent gradient backprop.
  • Figure 4: CMOS box plot (the median line in the SEGAN--Wiener comparison is located at 1). Positive values mean that SEGAN is preferred.