Table of Contents
Fetching ...

Personalized Speech Enhancement Without a Separate Speaker Embedding Model

Tanel Pärnamaa, Ando Saabas

TL;DR

The paper tackles the overhead of personalized speech enhancement by replacing a separate speaker-embedding model with the internal representation of a single PVQE-S model. It adapts DeepVQE to fuse speaker information and, crucially, shows that extracting the embedding from the model’s internal temporal states achieves performance on par with or better than two-stage approaches in noise suppression and echo cancellation. The approach yields state-of-the-art results on DNS data and even outperforms the DNS 2023 winner by 0.15 MOS in subjective tests, with particularly strong gains for small, real-time models. This single-model, auto-enrollable framework reduces training and deployment complexity and is well-suited for edge devices in teleconferencing scenarios.

Abstract

Personalized speech enhancement (PSE) models can improve the audio quality of teleconferencing systems by adapting to the characteristics of a speaker's voice. However, most existing methods require a separate speaker embedding model to extract a vector representation of the speaker from enrollment audio, which adds complexity to the training and deployment process. We propose to use the internal representation of the PSE model itself as the speaker embedding, thereby avoiding the need for a separate model. We show that our approach performs equally well or better than the standard method of using a pre-trained speaker embedding model on noise suppression and echo cancellation tasks. Moreover, our approach surpasses the ICASSP 2023 Deep Noise Suppression Challenge winner by 0.15 in Mean Opinion Score.

Personalized Speech Enhancement Without a Separate Speaker Embedding Model

TL;DR

The paper tackles the overhead of personalized speech enhancement by replacing a separate speaker-embedding model with the internal representation of a single PVQE-S model. It adapts DeepVQE to fuse speaker information and, crucially, shows that extracting the embedding from the model’s internal temporal states achieves performance on par with or better than two-stage approaches in noise suppression and echo cancellation. The approach yields state-of-the-art results on DNS data and even outperforms the DNS 2023 winner by 0.15 MOS in subjective tests, with particularly strong gains for small, real-time models. This single-model, auto-enrollable framework reduces training and deployment complexity and is well-suited for edge devices in teleconferencing scenarios.

Abstract

Personalized speech enhancement (PSE) models can improve the audio quality of teleconferencing systems by adapting to the characteristics of a speaker's voice. However, most existing methods require a separate speaker embedding model to extract a vector representation of the speaker from enrollment audio, which adds complexity to the training and deployment process. We propose to use the internal representation of the PSE model itself as the speaker embedding, thereby avoiding the need for a separate model. We show that our approach performs equally well or better than the standard method of using a pre-trained speaker embedding model on noise suppression and echo cancellation tasks. Moreover, our approach surpasses the ICASSP 2023 Deep Noise Suppression Challenge winner by 0.15 in Mean Opinion Score.
Paper Structure (14 sections, 1 figure, 2 tables)

This paper contains 14 sections, 1 figure, 2 tables.

Figures (1)

  • Figure 1: Model architecture with speaker information fusion. The figure shows how the speaker embedding is concatenated with the encoder features, the details of the temporal block, and the location of the internal embedding that we use to characterise speakers.