Table of Contents
Fetching ...

Real Time Speech Enhancement in the Waveform Domain

Alexandre Defossez, Gabriel Synnaeve, Yossi Adi

TL;DR

This work tackles real-time single-channel speech enhancement by adapting the Demucs model to operate causally in the raw waveform domain on CPU hardware. It combines an encoder-decoder with skip connections and a sequence model to produce a clean waveform directly, trained with a joint waveform L1 and multi-resolution STFT objective, plus waveform-domain data augmentations. The approach achieves competitive, state-of-the-art performance on Valentini and DNS benchmarks in both objective and subjective evaluations, while enabling real-time streaming. It also shows that enhanced speech can meaningfully improve ASR performance under noisy conditions, underscoring practical impact for communications and accessibility tools.

Abstract

We present a causal speech enhancement model working on the raw waveform that runs in real-time on a laptop CPU. The proposed model is based on an encoder-decoder architecture with skip-connections. It is optimized on both time and frequency domains, using multiple loss functions. Empirical evidence shows that it is capable of removing various kinds of background noise including stationary and non-stationary noises, as well as room reverb. Additionally, we suggest a set of data augmentation techniques applied directly on the raw waveform which further improve model performance and its generalization abilities. We perform evaluations on several standard benchmarks, both using objective metrics and human judgements. The proposed model matches state-of-the-art performance of both causal and non causal methods while working directly on the raw waveform.

Real Time Speech Enhancement in the Waveform Domain

TL;DR

This work tackles real-time single-channel speech enhancement by adapting the Demucs model to operate causally in the raw waveform domain on CPU hardware. It combines an encoder-decoder with skip connections and a sequence model to produce a clean waveform directly, trained with a joint waveform L1 and multi-resolution STFT objective, plus waveform-domain data augmentations. The approach achieves competitive, state-of-the-art performance on Valentini and DNS benchmarks in both objective and subjective evaluations, while enabling real-time streaming. It also shows that enhanced speech can meaningfully improve ASR performance under noisy conditions, underscoring practical impact for communications and accessibility tools.

Abstract

We present a causal speech enhancement model working on the raw waveform that runs in real-time on a laptop CPU. The proposed model is based on an encoder-decoder architecture with skip-connections. It is optimized on both time and frequency domains, using multiple loss functions. Empirical evidence shows that it is capable of removing various kinds of background noise including stationary and non-stationary noises, as well as room reverb. Additionally, we suggest a set of data augmentation techniques applied directly on the raw waveform which further improve model performance and its generalization abilities. We perform evaluations on several standard benchmarks, both using objective metrics and human judgements. The proposed model matches state-of-the-art performance of both causal and non causal methods while working directly on the raw waveform.

Paper Structure

This paper contains 13 sections, 2 equations, 1 figure, 4 tables.

Figures (1)

  • Figure 1: Causal Demucs architecture on the left, with detailed representation of the encoder and decoder layers on the right. The on the fly resampling of the input/output by a factor of $U$ is not represented.