Table of Contents
Fetching ...

Hold Me Tight: Stable Encoder-Decoder Design for Speech Enhancement

Daniel Haider, Felix Perfler, Vincent Lostanlen, Martin Ehler, Peter Balazs

TL;DR

This paper addresses 1-D filters on raw audio with hybrid solutions, i.e., combining theory-driven and data-driven approaches, and significantly improves the perceptual evaluation of speech quality (PESQ) in speech enhancement.

Abstract

Convolutional layers with 1-D filters are often used as frontend to encode audio signals. Unlike fixed time-frequency representations, they can adapt to the local characteristics of input data. However, 1-D filters on raw audio are hard to train and often suffer from instabilities. In this paper, we address these problems with hybrid solutions, i.e., combining theory-driven and data-driven approaches. First, we preprocess the audio signals via a auditory filterbank, guaranteeing good frequency localization for the learned encoder. Second, we use results from frame theory to define an unsupervised learning objective that encourages energy conservation and perfect reconstruction. Third, we adapt mixed compressed spectral norms as learning objectives to the encoder coefficients. Using these solutions in a low-complexity encoder-mask-decoder model significantly improves the perceptual evaluation of speech quality (PESQ) in speech enhancement.

Hold Me Tight: Stable Encoder-Decoder Design for Speech Enhancement

TL;DR

This paper addresses 1-D filters on raw audio with hybrid solutions, i.e., combining theory-driven and data-driven approaches, and significantly improves the perceptual evaluation of speech quality (PESQ) in speech enhancement.

Abstract

Convolutional layers with 1-D filters are often used as frontend to encode audio signals. Unlike fixed time-frequency representations, they can adapt to the local characteristics of input data. However, 1-D filters on raw audio are hard to train and often suffer from instabilities. In this paper, we address these problems with hybrid solutions, i.e., combining theory-driven and data-driven approaches. First, we preprocess the audio signals via a auditory filterbank, guaranteeing good frequency localization for the learned encoder. Second, we use results from frame theory to define an unsupervised learning objective that encourages energy conservation and perfect reconstruction. Third, we adapt mixed compressed spectral norms as learning objectives to the encoder coefficients. Using these solutions in a low-complexity encoder-mask-decoder model significantly improves the perceptual evaluation of speech quality (PESQ) in speech enhancement.
Paper Structure (16 sections, 1 theorem, 9 equations, 3 figures, 1 table)

This paper contains 16 sections, 1 theorem, 9 equations, 3 figures, 1 table.

Key Result

Theorem 2.1

Let $\Psi$ be a tight filterbank with frame bound $A_{\Psi}$ and $\Phi$ a random filterbank with length-$T$ filters. The associated hybrid filterbank $\Phi_{\Psi}$ is a random tight frame with

Figures (3)

  • Figure 1: The log magnitude responses of three encoders for the same speech signal. Left to right: Auditory filterbank, random filterbank, and hybrid filterbank as channel--wise composition of the previous two. While the random responses are hard to interpret, the hybrid responses are comparable to the fixed ones with the possibility to be fine--tuned in a data--driven manner.
  • Figure 2: Selections of real and imaginary parts of filters (top) and their frequency responses (bottom) from three different filterbanks. From left to right: An auditory filterbank with center frequencies uniformly on the mel scale, a random filterbank with $\sigma^2 = (TJ)^{-1}$, and a hybrid auditory filterbank as the channel-wise composition of the previous two. Different filters are displayed with different colors.
  • Figure 3: Left: Encoder--mask--decoder architecture: Encoder $\Phi$ (convolution), mask $M$ (point-wise multiplication), decoder $\Phi^\top$ (convolution and summation). Right: Mask model architecture consisting of feed-forward layers and gated recurrent units.

Theorems & Definitions (1)

  • Theorem 2.1