Ultra-Low Latency Speech Enhancement - A Comprehensive Study

Haibin Wu; Sebastian Braun

Ultra-Low Latency Speech Enhancement - A Comprehensive Study

Haibin Wu, Sebastian Braun

TL;DR

The paper tackles the challenge of fairly evaluating ultra-low-latency speech enhancement methods by proposing a unified framework trained on large-scale data and tested on real-world DNS data. It systematically compares five approaches—symmetric/asymmetric STFT windows, learnable analysis/synthesis transforms, trainable filterbank equalizers, and future-frame prediction—within a consistent base pipeline. Key findings show that learnable windows outperform fixed STFT, asymmetric windows mainly help weaker models, and increasing model size can fully compensate for reduced window sizes, while future-frame prediction offers limited gains and Mamba struggles at very low latency. The results offer actionable guidance for deploying low-latency SE in hearables and VoIP, emphasizing model capacity and robust evaluation over niche latency tricks.

Abstract

Speech enhancement models should meet very low latency requirements typically smaller than 5 ms for hearing assistive devices. While various low-latency techniques have been proposed, comparing these methods in a controlled setup using DNNs remains blank. Previous papers have variations in task, training data, scripts, and evaluation settings, which make fair comparison impossible. Moreover, all methods are tested on small, simulated datasets, making it difficult to fairly assess their performance in real-world conditions, which could impact the reliability of scientific findings. To address these issues, we comprehensively investigate various low-latency techniques using consistent training on large-scale data and evaluate with more relevant metrics on real-world data. Specifically, we explore the effectiveness of asymmetric windows, learnable windows, adaptive time domain filterbanks, and the future-frame prediction technique. Additionally, we examine whether increasing the model size can compensate for the reduced window size, as well as the novel Mamba architecture in low-latency environments.

Ultra-Low Latency Speech Enhancement - A Comprehensive Study

TL;DR

Abstract

Ultra-Low Latency Speech Enhancement - A Comprehensive Study

Authors

TL;DR

Abstract

Table of Contents

Figures (1)