Speech Denoising with Auditory Models
Mark R. Saddler, Andrew Francl, Jenelle Feather, Kaizhi Qian, Yang Zhang, Josh H. McDermott
TL;DR
The paper investigates whether deep perceptual losses derived from recognition networks trained on word and environmental sounds can improve speech denoising over traditional waveform reconstruction. It introduces a two_component framework combining recognition_network_based losses with a Wave_U_Net denoiser and evaluates multiple loss families including cochlear_model losses, using extensive human perceptual tests and standard metrics. The key finding is that deep feature losses can enhance denoising relative to waveform_loss baselines, but the same benefits are achievable with a standard multi_channel auditory filter_bank, suggesting that learned features do not yet provide a unique advantage. The results underscore limitations of current perceptual losses and objective metrics, and point to the need for more expressive transforms or perceptual models to yield robust, generalizable improvements in speech enhancement. The work also highlights that perceptual gains may not be fully captured by traditional metrics, motivating the development of perceptually aligned objective measures.
Abstract
Contemporary speech enhancement predominantly relies on audio transforms that are trained to reconstruct a clean speech waveform. The development of high-performing neural network sound recognition systems has raised the possibility of using deep feature representations as 'perceptual' losses with which to train denoising systems. We explored their utility by first training deep neural networks to classify either spoken words or environmental sounds from audio. We then trained an audio transform to map noisy speech to an audio waveform that minimized the difference in the deep feature representations between the output audio and the corresponding clean audio. The resulting transforms removed noise substantially better than baseline methods trained to reconstruct clean waveforms, and also outperformed previous methods using deep feature losses. However, a similar benefit was obtained simply by using losses derived from the filter bank inputs to the deep networks. The results show that deep features can guide speech enhancement, but suggest that they do not yet outperform simple alternatives that do not involve learned features.
