Biomimetic Frontend for Differentiable Audio Processing

Ruolan Leslie Famularo; Dmitry N. Zotkin; Shihab A. Shamma; Ramani Duraiswami

Biomimetic Frontend for Differentiable Audio Processing

Ruolan Leslie Famularo, Dmitry N. Zotkin, Shihab A. Shamma, Ramani Duraiswami

TL;DR

A classical model of human hearing is built on and made differentiable, so that it can combine traditional explainable biomimetic signal processing approaches with deep-learning frameworks to arrive at an expressive and explainable model that is easily trained on modest amounts of data.

Abstract

While models in audio and speech processing are becoming deeper and more end-to-end, they as a consequence need expensive training on large data, and are often brittle. We build on a classical model of human hearing and make it differentiable, so that we can combine traditional explainable biomimetic signal processing approaches with deep-learning frameworks. This allows us to arrive at an expressive and explainable model that is easily trained on modest amounts of data. We apply this model to audio processing tasks, including classification and enhancement. Results show that our differentiable model surpasses black-box approaches in terms of computational efficiency and robustness, even with little training data. We also discuss other potential applications.

Biomimetic Frontend for Differentiable Audio Processing

TL;DR

Abstract

Paper Structure (11 sections, 2 equations, 3 figures, 1 table)

This paper contains 11 sections, 2 equations, 3 figures, 1 table.

Introduction
The auditory processing model
Step 1. A Model of the Cochlear and Peripheral Hearing
Step 2. Cortical Features
Making the Model Differentiable
Applications of the Differentiable Frontend
Phoneme Recognition
Speech Enhancement
Explanability
Discussion and Conclusions
Acknowledgement

Figures (3)

Figure 1: Auditory processing from cochlea to cortical representations is shown from left to right. Black arrows indicate the forward model. Blue arrows indicate the direction of gradient calculation using the chain rule.
Figure 2: Results for Phoneme Recognition. The $x$-axis shows test conditions where test speech was mixed with pink noise at -3, 0, and 3 dB SNR or clean. The $y$-axis shows classification accuracy. Solid lines denote models initialized with random cortical parameters, and dashed lines denote models initialized with log-spaced cortical parameters. Error bars denote 95% CI.
Figure 3: Distribution of trained spectrotemporal filter parameters. Left: phoneme recognition in quiet; right: speech enhancement. Each point represents one of 40 filters. Two models are shown in each panel, respectively initialized from log-spaced values (cross) and randomized values (dot). The sign of temporal modulation encodes the direction of the spectrotemporal modulation -- positive indicates upward-tilting Gabors and vice versa.

Biomimetic Frontend for Differentiable Audio Processing

TL;DR

Abstract

Biomimetic Frontend for Differentiable Audio Processing

Authors

TL;DR

Abstract

Table of Contents

Figures (3)