Hold Me Tight: Stable Encoder-Decoder Design for Speech Enhancement

Daniel Haider; Felix Perfler; Vincent Lostanlen; Martin Ehler; Peter Balazs

Hold Me Tight: Stable Encoder-Decoder Design for Speech Enhancement

Daniel Haider, Felix Perfler, Vincent Lostanlen, Martin Ehler, Peter Balazs

TL;DR

This paper addresses 1-D filters on raw audio with hybrid solutions, i.e., combining theory-driven and data-driven approaches, and significantly improves the perceptual evaluation of speech quality (PESQ) in speech enhancement.

Abstract

Convolutional layers with 1-D filters are often used as frontend to encode audio signals. Unlike fixed time-frequency representations, they can adapt to the local characteristics of input data. However, 1-D filters on raw audio are hard to train and often suffer from instabilities. In this paper, we address these problems with hybrid solutions, i.e., combining theory-driven and data-driven approaches. First, we preprocess the audio signals via a auditory filterbank, guaranteeing good frequency localization for the learned encoder. Second, we use results from frame theory to define an unsupervised learning objective that encourages energy conservation and perfect reconstruction. Third, we adapt mixed compressed spectral norms as learning objectives to the encoder coefficients. Using these solutions in a low-complexity encoder-mask-decoder model significantly improves the perceptual evaluation of speech quality (PESQ) in speech enhancement.

Hold Me Tight: Stable Encoder-Decoder Design for Speech Enhancement

TL;DR

Abstract

Paper Structure (16 sections, 1 theorem, 9 equations, 3 figures, 1 table)

This paper contains 16 sections, 1 theorem, 9 equations, 3 figures, 1 table.

Introduction
Learning Tight Hybrid Filterbanks with Inductive Auditory Bias
Tight Filterbank Frames
Encoder Design: Hybrid Auditory Filterbanks
Stability of Hybrid Filterbanks and $\kappa$-penalization
Model Implementation for Speech Enhancement and Training
Encoder/Decoder Design
Mask Model Architecture
Training
Dataset
Results and Discussion
General
Speech Enhancement
Limitations and Outlook
Conclusion
...and 1 more sections

Key Result

Theorem 2.1

Let $\Psi$ be a tight filterbank with frame bound $A_{\Psi}$ and $\Phi$ a random filterbank with length-$T$ filters. The associated hybrid filterbank $\Phi_{\Psi}$ is a random tight frame with

Figures (3)

Figure 1: The log magnitude responses of three encoders for the same speech signal. Left to right: Auditory filterbank, random filterbank, and hybrid filterbank as channel--wise composition of the previous two. While the random responses are hard to interpret, the hybrid responses are comparable to the fixed ones with the possibility to be fine--tuned in a data--driven manner.
Figure 2: Selections of real and imaginary parts of filters (top) and their frequency responses (bottom) from three different filterbanks. From left to right: An auditory filterbank with center frequencies uniformly on the mel scale, a random filterbank with $\sigma^2 = (TJ)^{-1}$, and a hybrid auditory filterbank as the channel-wise composition of the previous two. Different filters are displayed with different colors.
Figure 3: Left: Encoder--mask--decoder architecture: Encoder $\Phi$ (convolution), mask $M$ (point-wise multiplication), decoder $\Phi^\top$ (convolution and summation). Right: Mask model architecture consisting of feed-forward layers and gated recurrent units.

Theorems & Definitions (1)

Theorem 2.1

Hold Me Tight: Stable Encoder-Decoder Design for Speech Enhancement

TL;DR

Abstract

Hold Me Tight: Stable Encoder-Decoder Design for Speech Enhancement

Authors

TL;DR

Abstract

Table of Contents

Key Result

Figures (3)

Theorems & Definitions (1)