Table of Contents
Fetching ...

Towards Sub-millisecond Latency Real-Time Speech Enhancement Models on Hearables

Artem Dementyev, Chandan K. A. Reddy, Scott Wisdom, Navin Chatlani, John R. Hershey, Richard F. Lyon

TL;DR

The paper tackles the challenge of real-time, on-device speech enhancement for hearables with sub-millisecond algorithmic latency under strict memory and power constraints. It introduces Deep FIR, a pipeline where an LSTM-based tap predictor generates a 128-tap FIR filter every hop, enabling sample-by-sample, causal filtering with very short synthesis windows; a post-inference minimum-phase conversion further reduces group delay. The approach achieves mean algorithmic latency as low as $0.34$ ms and, on a low-power DSP, end-to-end latency around $3.35$ ms, with a $626k$-parameter model and $16$ kHz audio. Objective metrics show SI-SDR and DNSMOS gains, while subjective ITU-P835 tests confirm perceptual improvements; the work also demonstrates real-time hardware deployment, highlighting practical viability for hearables. Overall, this work provides a viable path toward ultra-low-latency, on-device speech enhancement that can improve comfort and usability in hearables by minimizing artifacts like comb filtering while maintaining denoising performance.

Abstract

Low latency models are critical for real-time speech enhancement applications, such as hearing aids and hearables. However, the sub-millisecond latency space for resource-constrained hearables remains underexplored. We demonstrate speech enhancement using a computationally efficient minimum-phase FIR filter, enabling sample-by-sample processing to achieve mean algorithmic latency of 0.32 ms to 1.25 ms. With a single microphone, we observe a mean SI-SDRi of 4.1 dB. The approach shows generalization with a DNSMOS increase of 0.2 on unseen audio recordings. We use a lightweight LSTM-based model of 626k parameters to generate FIR taps. Using a real hardware implementation on a low-power DSP, our system can run with 376 MIPS and a mean end-to-end latency of 3.35 ms. In addition, we provide a comparison with existing low-latency spectral masking techniques. We hope this work will enable a better understanding of latency and can be used to improve the comfort and usability of hearables.

Towards Sub-millisecond Latency Real-Time Speech Enhancement Models on Hearables

TL;DR

The paper tackles the challenge of real-time, on-device speech enhancement for hearables with sub-millisecond algorithmic latency under strict memory and power constraints. It introduces Deep FIR, a pipeline where an LSTM-based tap predictor generates a 128-tap FIR filter every hop, enabling sample-by-sample, causal filtering with very short synthesis windows; a post-inference minimum-phase conversion further reduces group delay. The approach achieves mean algorithmic latency as low as ms and, on a low-power DSP, end-to-end latency around ms, with a -parameter model and kHz audio. Objective metrics show SI-SDR and DNSMOS gains, while subjective ITU-P835 tests confirm perceptual improvements; the work also demonstrates real-time hardware deployment, highlighting practical viability for hearables. Overall, this work provides a viable path toward ultra-low-latency, on-device speech enhancement that can improve comfort and usability in hearables by minimizing artifacts like comb filtering while maintaining denoising performance.

Abstract

Low latency models are critical for real-time speech enhancement applications, such as hearing aids and hearables. However, the sub-millisecond latency space for resource-constrained hearables remains underexplored. We demonstrate speech enhancement using a computationally efficient minimum-phase FIR filter, enabling sample-by-sample processing to achieve mean algorithmic latency of 0.32 ms to 1.25 ms. With a single microphone, we observe a mean SI-SDRi of 4.1 dB. The approach shows generalization with a DNSMOS increase of 0.2 on unseen audio recordings. We use a lightweight LSTM-based model of 626k parameters to generate FIR taps. Using a real hardware implementation on a low-power DSP, our system can run with 376 MIPS and a mean end-to-end latency of 3.35 ms. In addition, we provide a comparison with existing low-latency spectral masking techniques. We hope this work will enable a better understanding of latency and can be used to improve the comfort and usability of hearables.
Paper Structure (11 sections, 1 equation, 3 figures, 3 tables)

This paper contains 11 sections, 1 equation, 3 figures, 3 tables.

Figures (3)

  • Figure 1: Visualization of end-to-end latency. A) Using the proposed Deep FIR filtering, as developed in this paper. B) Using the long-short time window (LSTW) technique. Note that the proposed Deep FIR can achieve lower end-to-end latency.
  • Figure 2: Inference time causal Deep FIR signal processing diagram, divided into synthesis and analysis. A new FIR filter is estimated every hop.
  • Figure 3: Example data before and after denoising. Mel spectrograms of A) original with white noise and B) speech denoised with a low-latency filter. White rectangles point to problematic areas where the noise was not removed by the model.