Table of Contents
Fetching ...

FADI-AEC: Fast Score Based Diffusion Model Guided by Far-end Signal for Acoustic Echo Cancellation

Yang Liu, Li Wan, Yun Li, Yiteng Huang, Ming Sun, James Luan, Yangyang Shi, Xin Lei

TL;DR

The paper tackles acoustic echo cancellation by introducing diffusion-based approaches that provide probabilistic, high-quality clean speech estimates. It proposes DI-AEC and its efficient variant FADI-AEC, which leverages far-end guided noise and a fast, per-frame score computation to reduce computational load for edge devices. A far-end conditioned score model and a fast score formulation enable stable reconstruction with markedly lower latency while preserving or improving quality metrics like ERLE and PESQ. Evaluations on the ICASSP 2023 deep echo cancellation dataset show competitive or superior performance to state-of-the-art methods, highlighting practical impact for real-time, resource-constrained scenarios.

Abstract

Despite the potential of diffusion models in speech enhancement, their deployment in Acoustic Echo Cancellation (AEC) has been restricted. In this paper, we propose DI-AEC, pioneering a diffusion-based stochastic regeneration approach dedicated to AEC. Further, we propose FADI-AEC, fast score-based diffusion AEC framework to save computational demands, making it favorable for edge devices. It stands out by running the score model once per frame, achieving a significant surge in processing efficiency. Apart from that, we introduce a novel noise generation technique where far-end signals are utilized, incorporating both far-end and near-end signals to refine the score model's accuracy. We test our proposed method on the ICASSP2023 Microsoft deep echo cancellation challenge evaluation dataset, where our method outperforms some of the end-to-end methods and other diffusion based echo cancellation methods.

FADI-AEC: Fast Score Based Diffusion Model Guided by Far-end Signal for Acoustic Echo Cancellation

TL;DR

The paper tackles acoustic echo cancellation by introducing diffusion-based approaches that provide probabilistic, high-quality clean speech estimates. It proposes DI-AEC and its efficient variant FADI-AEC, which leverages far-end guided noise and a fast, per-frame score computation to reduce computational load for edge devices. A far-end conditioned score model and a fast score formulation enable stable reconstruction with markedly lower latency while preserving or improving quality metrics like ERLE and PESQ. Evaluations on the ICASSP 2023 deep echo cancellation dataset show competitive or superior performance to state-of-the-art methods, highlighting practical impact for real-time, resource-constrained scenarios.

Abstract

Despite the potential of diffusion models in speech enhancement, their deployment in Acoustic Echo Cancellation (AEC) has been restricted. In this paper, we propose DI-AEC, pioneering a diffusion-based stochastic regeneration approach dedicated to AEC. Further, we propose FADI-AEC, fast score-based diffusion AEC framework to save computational demands, making it favorable for edge devices. It stands out by running the score model once per frame, achieving a significant surge in processing efficiency. Apart from that, we introduce a novel noise generation technique where far-end signals are utilized, incorporating both far-end and near-end signals to refine the score model's accuracy. We test our proposed method on the ICASSP2023 Microsoft deep echo cancellation challenge evaluation dataset, where our method outperforms some of the end-to-end methods and other diffusion based echo cancellation methods.
Paper Structure (13 sections, 9 equations, 1 figure, 2 tables)

This paper contains 13 sections, 9 equations, 1 figure, 2 tables.

Figures (1)

  • Figure 1: FADI-AEC pipeline. The predictive filter is first used to generate an predicted estimate $\mathbf{\hat{\mathbf{s}}}(n)$ from mic signal $\mathbf{h}(n)$. Diffusion-based generation $G_{\phi}$ is then performed by adding Gaussian noise guided by far-end sign $\mathbf{x}(n)$ and solving the reverse diffusion SDE. The estimated near-end speech is $\mathbf{\tilde{s}}(n)$, which would be used in the score function in the next frame.