Table of Contents
Fetching ...

SMRU: Split-and-Merge Recurrent-based UNet for Acoustic Echo Cancellation and Noise Suppression

Zhihang Sun, Andong Li, Rilin Chen, Hao Zhang, Meng Yu, Yi Zhou, Dong Yu

TL;DR

SMRU addresses the need for deployment-friendly acoustic echo cancellation and noise suppression by combining a split-and-merge strategy on the frequency domain with a recurrent UNet that uses variable frame-rate processing. The approach achieves flexible computational budgets (from tens of millions to several billions of MACs) while maintaining or improving performance across edge and cloud scenarios, aided by cross-scale skip connections and an inter-band MLP shuffler. A lightweight post-processing stage and a composite loss (MAE plus echo-aware and VAD-guided terms) further enhance near-end speech preservation and echo suppression. Experimental results on synthetic and blind AEC datasets demonstrate competitive gains over strong baselines, with robust generalization and practical efficiency for real-time applications.

Abstract

The proliferation of deep neural networks has spawned the rapid development of acoustic echo cancellation and noise suppression, and plenty of prior arts have been proposed, which yield promising performance. Nevertheless, they rarely consider the deployment generality in different processing scenarios, such as edge devices, and cloud processing. To this end, this paper proposes a general model, termed SMRU, to cover different application scenarios. The novelty lies in two-fold. First, a multi-scale band split layer and band merge layer are proposed to effectively fuse local frequency bands for lower complexity modeling. Besides, by simulating the multi-resolution feature modeling characteristic of the classical UNet structure, a novel recurrent-dominated UNet is devised. It consists of multiple variable frame rate blocks, each of which involves the causal time down-/up-sampling layer with varying compression ratios and the dual-path structure for inter- and intra-band modeling. The model is configured from 50 M/s to 6.8 G/s in terms of MACs, and the experimental results show that the proposed approach yields competitive or even better performance over existing baselines, and has the full potential to adapt to more general scenarios with varying complexity requirements.

SMRU: Split-and-Merge Recurrent-based UNet for Acoustic Echo Cancellation and Noise Suppression

TL;DR

SMRU addresses the need for deployment-friendly acoustic echo cancellation and noise suppression by combining a split-and-merge strategy on the frequency domain with a recurrent UNet that uses variable frame-rate processing. The approach achieves flexible computational budgets (from tens of millions to several billions of MACs) while maintaining or improving performance across edge and cloud scenarios, aided by cross-scale skip connections and an inter-band MLP shuffler. A lightweight post-processing stage and a composite loss (MAE plus echo-aware and VAD-guided terms) further enhance near-end speech preservation and echo suppression. Experimental results on synthetic and blind AEC datasets demonstrate competitive gains over strong baselines, with robust generalization and practical efficiency for real-time applications.

Abstract

The proliferation of deep neural networks has spawned the rapid development of acoustic echo cancellation and noise suppression, and plenty of prior arts have been proposed, which yield promising performance. Nevertheless, they rarely consider the deployment generality in different processing scenarios, such as edge devices, and cloud processing. To this end, this paper proposes a general model, termed SMRU, to cover different application scenarios. The novelty lies in two-fold. First, a multi-scale band split layer and band merge layer are proposed to effectively fuse local frequency bands for lower complexity modeling. Besides, by simulating the multi-resolution feature modeling characteristic of the classical UNet structure, a novel recurrent-dominated UNet is devised. It consists of multiple variable frame rate blocks, each of which involves the causal time down-/up-sampling layer with varying compression ratios and the dual-path structure for inter- and intra-band modeling. The model is configured from 50 M/s to 6.8 G/s in terms of MACs, and the experimental results show that the proposed approach yields competitive or even better performance over existing baselines, and has the full potential to adapt to more general scenarios with varying complexity requirements.
Paper Structure (20 sections, 14 equations, 4 figures, 3 tables)

This paper contains 20 sections, 14 equations, 4 figures, 3 tables.

Figures (4)

  • Figure 1: Overview diagram of the proposed hybrid AEC system.
  • Figure 2: Architecture of the proposed SMRU. Different modules are indicated with different colors for better illustrations. (a) Overall diagram of the proposed SMRU. (b) Detail structure of the multi-scale band split layer. (c) Detail structure of the band merge layer. (d) Detail structure of the variable frame rate block. (e) Detail structure of the inter-band MLP.
  • Figure 3: AECMOS metrics of the blind test set under the DT scenario.
  • Figure 4: Spectrum visualizations of an example. (a) Mix audio. (b) Target near-end speech. (c) Estimated spectrum processed by DeepFilterNet. (d) Estimated spectrum processed by SMRU-S.