Table of Contents
Fetching ...

Speech Boosting: Low-Latency Live Speech Enhancement for TWS Earbuds

Hanbin Bae, Pavel Andreev, Azat Saginbaev, Nicholas Babaev, Won-Jun Lee, Hosang Sung, Hoon-Young Cho

TL;DR

This work tackles low-latency, on-device speech enhancement for conversations in noisy environments with active noise cancellation on TWS earbuds, targeting algorithmic latencies under 3 ms and minimal computational load. It compares time-domain and frequency-domain architectures, favors a Wave-U-Net+LSTM baseline, and introduces PFPL-based two-stage training to preserve intelligibility while mitigating artifacts. The authors implement a SPDY+OBC pruning pipeline to achieve ~90% sparsity, reducing complexity to about 0.21 GMAC while maintaining MOS performance, and demonstrate hardware-aware optimization on HiFi4 DSP with on-device feasibility (≈291 MCPS, ~800 kB). The results show substantial perceptual quality with dramatically reduced latency and resource usage, suggesting practical viability for Co-operated ANC and beamforming in real-world TWS earbuds. This work advances the deployment of low-latency, high-quality speech enhancement in consumer earbuds, with direct implications for improved in-conversation intelligibility in noisy settings.

Abstract

This paper introduces a speech enhancement solution tailored for true wireless stereo (TWS) earbuds on-device usage. The solution was specifically designed to support conversations in noisy environments, with active noise cancellation (ANC) activated. The primary challenges for speech enhancement models in this context arise from computational complexity that limits on-device usage and latency that must be less than 3 ms to preserve a live conversation. To address these issues, we evaluated several crucial design elements, including the network architecture and domain, design of loss functions, pruning method, and hardware-specific optimization. Consequently, we demonstrated substantial improvements in speech enhancement quality compared with that in baseline models, while simultaneously reducing the computational complexity and algorithmic latency.

Speech Boosting: Low-Latency Live Speech Enhancement for TWS Earbuds

TL;DR

This work tackles low-latency, on-device speech enhancement for conversations in noisy environments with active noise cancellation on TWS earbuds, targeting algorithmic latencies under 3 ms and minimal computational load. It compares time-domain and frequency-domain architectures, favors a Wave-U-Net+LSTM baseline, and introduces PFPL-based two-stage training to preserve intelligibility while mitigating artifacts. The authors implement a SPDY+OBC pruning pipeline to achieve ~90% sparsity, reducing complexity to about 0.21 GMAC while maintaining MOS performance, and demonstrate hardware-aware optimization on HiFi4 DSP with on-device feasibility (≈291 MCPS, ~800 kB). The results show substantial perceptual quality with dramatically reduced latency and resource usage, suggesting practical viability for Co-operated ANC and beamforming in real-world TWS earbuds. This work advances the deployment of low-latency, high-quality speech enhancement in consumer earbuds, with direct implications for improved in-conversation intelligibility in noisy settings.

Abstract

This paper introduces a speech enhancement solution tailored for true wireless stereo (TWS) earbuds on-device usage. The solution was specifically designed to support conversations in noisy environments, with active noise cancellation (ANC) activated. The primary challenges for speech enhancement models in this context arise from computational complexity that limits on-device usage and latency that must be less than 3 ms to preserve a live conversation. To address these issues, we evaluated several crucial design elements, including the network architecture and domain, design of loss functions, pruning method, and hardware-specific optimization. Consequently, we demonstrated substantial improvements in speech enhancement quality compared with that in baseline models, while simultaneously reducing the computational complexity and algorithmic latency.
Paper Structure (15 sections, 2 figures, 5 tables)

This paper contains 15 sections, 2 figures, 5 tables.

Figures (2)

  • Figure 1: Architecture of the baseline Wave-U-Net + LSTM model.
  • Figure 2: Examples of speech denoising performance.