Differentiable Time-Varying IIR Filtering for Real-Time Speech Denoising

Riccardo Rota; Kiril Ratmanski; Jozef Coldenhoff; Milos Cernak

Differentiable Time-Varying IIR Filtering for Real-Time Speech Denoising

Riccardo Rota, Kiril Ratmanski, Jozef Coldenhoff, Milos Cernak

TL;DR

TVF (Time-Varying Filtering), a low-latency speech enhancement model with 1 million parameters, bridges the gap between traditional filtering and modern neural speech modeling and achieves effective adaptation to changing noise conditions.

Abstract

We present TVF (Time-Varying Filtering), a low-latency speech enhancement model with 1 million parameters. Combining the interpretability of Digital Signal Processing (DSP) with the adaptability of deep learning, TVF bridges the gap between traditional filtering and modern neural speech modeling. The model utilizes a lightweight neural network backbone to predict the coefficients of a differentiable 35-band IIR filter cascade in real time, allowing it to adapt dynamically to non-stationary noise. Unlike ``black-box'' deep learning approaches, TVF offers a completely interpretable processing chain, where spectral modifications are explicit and adjustable. We demonstrate the efficacy of this approach on a speech denoising task using the Valentini-Botinhao dataset and compare the results to a static DDSP approach and a fully deep-learning-based solution, showing that TVF achieves effective adaptation to changing noise conditions.

Differentiable Time-Varying IIR Filtering for Real-Time Speech Denoising

TL;DR

Abstract

Paper Structure (18 sections, 3 equations, 2 figures, 1 table)

This paper contains 18 sections, 3 equations, 2 figures, 1 table.

Introduction
Related Work
Methodology
Machine Learning Backbone
Time Varying IIR Filter Cascade
IIR Filtering Implementation
Weights Initialization
The Static PEQ Baseline
Experiments
Dataset
Training Setup
Evaluation Metrics
Results
Objective Evaluation
Perceptual Quality Assessment
...and 3 more sections

Figures (2)

Figure 1: Model architecture: Here $T$ is the total number of samples, $N$ is the number of frames, $L=1024$ is the frame length, $F=513$ is the number of frequency bins, $C=129$ is the number of features per channel after the two convolutions, $D=256$ is the hidden and output dimension of the GRU.
Figure 2: Analysis of the filtering on a track with non-stationary background noise. From top to bottom: noisy input spectrogram, adaptive frequency response of TVF, denoised output.

Differentiable Time-Varying IIR Filtering for Real-Time Speech Denoising

TL;DR

Abstract

Differentiable Time-Varying IIR Filtering for Real-Time Speech Denoising

Authors

TL;DR

Abstract

Table of Contents

Figures (2)