Table of Contents
Fetching ...

Differentiable Time-Varying IIR Filtering for Real-Time Speech Denoising

Riccardo Rota, Kiril Ratmanski, Jozef Coldenhoff, Milos Cernak

TL;DR

TVF (Time-Varying Filtering), a low-latency speech enhancement model with 1 million parameters, bridges the gap between traditional filtering and modern neural speech modeling and achieves effective adaptation to changing noise conditions.

Abstract

We present TVF (Time-Varying Filtering), a low-latency speech enhancement model with 1 million parameters. Combining the interpretability of Digital Signal Processing (DSP) with the adaptability of deep learning, TVF bridges the gap between traditional filtering and modern neural speech modeling. The model utilizes a lightweight neural network backbone to predict the coefficients of a differentiable 35-band IIR filter cascade in real time, allowing it to adapt dynamically to non-stationary noise. Unlike ``black-box'' deep learning approaches, TVF offers a completely interpretable processing chain, where spectral modifications are explicit and adjustable. We demonstrate the efficacy of this approach on a speech denoising task using the Valentini-Botinhao dataset and compare the results to a static DDSP approach and a fully deep-learning-based solution, showing that TVF achieves effective adaptation to changing noise conditions.

Differentiable Time-Varying IIR Filtering for Real-Time Speech Denoising

TL;DR

TVF (Time-Varying Filtering), a low-latency speech enhancement model with 1 million parameters, bridges the gap between traditional filtering and modern neural speech modeling and achieves effective adaptation to changing noise conditions.

Abstract

We present TVF (Time-Varying Filtering), a low-latency speech enhancement model with 1 million parameters. Combining the interpretability of Digital Signal Processing (DSP) with the adaptability of deep learning, TVF bridges the gap between traditional filtering and modern neural speech modeling. The model utilizes a lightweight neural network backbone to predict the coefficients of a differentiable 35-band IIR filter cascade in real time, allowing it to adapt dynamically to non-stationary noise. Unlike ``black-box'' deep learning approaches, TVF offers a completely interpretable processing chain, where spectral modifications are explicit and adjustable. We demonstrate the efficacy of this approach on a speech denoising task using the Valentini-Botinhao dataset and compare the results to a static DDSP approach and a fully deep-learning-based solution, showing that TVF achieves effective adaptation to changing noise conditions.
Paper Structure (18 sections, 3 equations, 2 figures, 1 table)

This paper contains 18 sections, 3 equations, 2 figures, 1 table.

Figures (2)

  • Figure 1: Model architecture: Here $T$ is the total number of samples, $N$ is the number of frames, $L=1024$ is the frame length, $F=513$ is the number of frequency bins, $C=129$ is the number of features per channel after the two convolutions, $D=256$ is the hidden and output dimension of the GRU.
  • Figure 2: Analysis of the filtering on a track with non-stationary background noise. From top to bottom: noisy input spectrogram, adaptive frequency response of TVF, denoised output.