SMRU: Split-and-Merge Recurrent-based UNet for Acoustic Echo Cancellation and Noise Suppression
Zhihang Sun, Andong Li, Rilin Chen, Hao Zhang, Meng Yu, Yi Zhou, Dong Yu
TL;DR
SMRU addresses the need for deployment-friendly acoustic echo cancellation and noise suppression by combining a split-and-merge strategy on the frequency domain with a recurrent UNet that uses variable frame-rate processing. The approach achieves flexible computational budgets (from tens of millions to several billions of MACs) while maintaining or improving performance across edge and cloud scenarios, aided by cross-scale skip connections and an inter-band MLP shuffler. A lightweight post-processing stage and a composite loss (MAE plus echo-aware and VAD-guided terms) further enhance near-end speech preservation and echo suppression. Experimental results on synthetic and blind AEC datasets demonstrate competitive gains over strong baselines, with robust generalization and practical efficiency for real-time applications.
Abstract
The proliferation of deep neural networks has spawned the rapid development of acoustic echo cancellation and noise suppression, and plenty of prior arts have been proposed, which yield promising performance. Nevertheless, they rarely consider the deployment generality in different processing scenarios, such as edge devices, and cloud processing. To this end, this paper proposes a general model, termed SMRU, to cover different application scenarios. The novelty lies in two-fold. First, a multi-scale band split layer and band merge layer are proposed to effectively fuse local frequency bands for lower complexity modeling. Besides, by simulating the multi-resolution feature modeling characteristic of the classical UNet structure, a novel recurrent-dominated UNet is devised. It consists of multiple variable frame rate blocks, each of which involves the causal time down-/up-sampling layer with varying compression ratios and the dual-path structure for inter- and intra-band modeling. The model is configured from 50 M/s to 6.8 G/s in terms of MACs, and the experimental results show that the proposed approach yields competitive or even better performance over existing baselines, and has the full potential to adapt to more general scenarios with varying complexity requirements.
