Table of Contents
Fetching ...

FINALLY: fast and universal speech enhancement with studio-like quality

Nicholas Babaev, Kirill Tamogashev, Azat Saginbaev, Ivan Shchekotov, Hanbin Bae, Hosang Sung, WonJun Lee, Hoon-Young Cho, Pavel Andreev

TL;DR

This paper integrates WavLM-based perceptual loss into MS-STFT adversarial training pipeline, creating an effective and stable training procedure for the speech enhancement model, which builds upon the HiFi++ architecture, augmented with a WavLM encoder and a novel training pipeline.

Abstract

In this paper, we address the challenge of speech enhancement in real-world recordings, which often contain various forms of distortion, such as background noise, reverberation, and microphone artifacts. We revisit the use of Generative Adversarial Networks (GANs) for speech enhancement and theoretically show that GANs are naturally inclined to seek the point of maximum density within the conditional clean speech distribution, which, as we argue, is essential for the speech enhancement task. We study various feature extractors for perceptual loss to facilitate the stability of adversarial training, developing a methodology for probing the structure of the feature space. This leads us to integrate WavLM-based perceptual loss into MS-STFT adversarial training pipeline, creating an effective and stable training procedure for the speech enhancement model. The resulting speech enhancement model, which we refer to as FINALLY, builds upon the HiFi++ architecture, augmented with a WavLM encoder and a novel training pipeline. Empirical results on various datasets confirm our model's ability to produce clear, high-quality speech at 48 kHz, achieving state-of-the-art performance in the field of speech enhancement. Demo page: https://samsunglabs.github.io/FINALLY-page

FINALLY: fast and universal speech enhancement with studio-like quality

TL;DR

This paper integrates WavLM-based perceptual loss into MS-STFT adversarial training pipeline, creating an effective and stable training procedure for the speech enhancement model, which builds upon the HiFi++ architecture, augmented with a WavLM encoder and a novel training pipeline.

Abstract

In this paper, we address the challenge of speech enhancement in real-world recordings, which often contain various forms of distortion, such as background noise, reverberation, and microphone artifacts. We revisit the use of Generative Adversarial Networks (GANs) for speech enhancement and theoretically show that GANs are naturally inclined to seek the point of maximum density within the conditional clean speech distribution, which, as we argue, is essential for the speech enhancement task. We study various feature extractors for perceptual loss to facilitate the stability of adversarial training, developing a methodology for probing the structure of the feature space. This leads us to integrate WavLM-based perceptual loss into MS-STFT adversarial training pipeline, creating an effective and stable training procedure for the speech enhancement model. The resulting speech enhancement model, which we refer to as FINALLY, builds upon the HiFi++ architecture, augmented with a WavLM encoder and a novel training pipeline. Empirical results on various datasets confirm our model's ability to produce clear, high-quality speech at 48 kHz, achieving state-of-the-art performance in the field of speech enhancement. Demo page: https://samsunglabs.github.io/FINALLY-page
Paper Structure (43 sections, 2 theorems, 13 equations, 10 figures, 12 tables)

This paper contains 43 sections, 2 theorems, 13 equations, 10 figures, 12 tables.

Key Result

Proposition 1

Let $p_{\text{clean}}(y|x) > 0$ be a finite and Lipschitz continuous density function with a unique global maximum and $p^{\xi}_g(y|x) = \xi^n / 2^n \cdot \mathbf{1}_{y - g_\theta(x) \in [ - 1/\xi, 1/\xi]^n}$, then

Figures (10)

  • Figure 1: Illustration of heuristic rules for feature space structure. The Clustering rule (left) states that representations of the same speech sound should form clusters. The SNR rule (right) states that noise samples should deviate from the centre of the cluster as the amount of noise increases. Illustrations created using real samples are presented in \ref{['fig:snr_real']}, \ref{['fig:clust_real']}
  • Figure 2: FINALLY model architecture.
  • Figure 3: Ground truth waveform and waveform resynthesized by HiFi GAN vocoder. While waveforms significantly differ, they correspond to the same sound, creating ambiguity in generation.
  • Figure 4: Regression in an entangled feature space might cause the expectation to lie outside the regions of high density, while regression in a disentangled space facilitates the expectation to lie within the regions of high probability density.
  • Figure 5: Comparison of training schemes on spectrograms. Going from the top to the bottom, the first spectrogram is obtained from the model trained only on WavLM chen2022wavlm features, the second one is produced by the model trained on both WavLM features and STFT L1 loss. The third spectrogram is obtained by training with both mentioned losses and adversarial loss. The last spectrogram is computed with ground truth audio.
  • ...and 5 more figures

Theorems & Definitions (3)

  • Proposition 1
  • Proposition 1
  • proof