Table of Contents
Fetching ...

Biomimetic Frontend for Differentiable Audio Processing

Ruolan Leslie Famularo, Dmitry N. Zotkin, Shihab A. Shamma, Ramani Duraiswami

TL;DR

A classical model of human hearing is built on and made differentiable, so that it can combine traditional explainable biomimetic signal processing approaches with deep-learning frameworks to arrive at an expressive and explainable model that is easily trained on modest amounts of data.

Abstract

While models in audio and speech processing are becoming deeper and more end-to-end, they as a consequence need expensive training on large data, and are often brittle. We build on a classical model of human hearing and make it differentiable, so that we can combine traditional explainable biomimetic signal processing approaches with deep-learning frameworks. This allows us to arrive at an expressive and explainable model that is easily trained on modest amounts of data. We apply this model to audio processing tasks, including classification and enhancement. Results show that our differentiable model surpasses black-box approaches in terms of computational efficiency and robustness, even with little training data. We also discuss other potential applications.

Biomimetic Frontend for Differentiable Audio Processing

TL;DR

A classical model of human hearing is built on and made differentiable, so that it can combine traditional explainable biomimetic signal processing approaches with deep-learning frameworks to arrive at an expressive and explainable model that is easily trained on modest amounts of data.

Abstract

While models in audio and speech processing are becoming deeper and more end-to-end, they as a consequence need expensive training on large data, and are often brittle. We build on a classical model of human hearing and make it differentiable, so that we can combine traditional explainable biomimetic signal processing approaches with deep-learning frameworks. This allows us to arrive at an expressive and explainable model that is easily trained on modest amounts of data. We apply this model to audio processing tasks, including classification and enhancement. Results show that our differentiable model surpasses black-box approaches in terms of computational efficiency and robustness, even with little training data. We also discuss other potential applications.
Paper Structure (11 sections, 2 equations, 3 figures, 1 table)

This paper contains 11 sections, 2 equations, 3 figures, 1 table.

Figures (3)

  • Figure 1: Auditory processing from cochlea to cortical representations is shown from left to right. Black arrows indicate the forward model. Blue arrows indicate the direction of gradient calculation using the chain rule.
  • Figure 2: Results for Phoneme Recognition. The $x$-axis shows test conditions where test speech was mixed with pink noise at -3, 0, and 3 dB SNR or clean. The $y$-axis shows classification accuracy. Solid lines denote models initialized with random cortical parameters, and dashed lines denote models initialized with log-spaced cortical parameters. Error bars denote 95% CI.
  • Figure 3: Distribution of trained spectrotemporal filter parameters. Left: phoneme recognition in quiet; right: speech enhancement. Each point represents one of 40 filters. Two models are shown in each panel, respectively initialized from log-spaced values (cross) and randomized values (dot). The sign of temporal modulation encodes the direction of the spectrotemporal modulation -- positive indicates upward-tilting Gabors and vice versa.