Table of Contents
Fetching ...

Improving Voice Quality in Speech Anonymization With Just Perception-Informed Losses

Suhita Ghosh, Tim Thiele, Frederic Lorbeer, Frank Dreyer, Sebastian Stober

TL;DR

It is demonstrated that a VQVAE-based model, enhanced with the authors' perception-driven losses, surpasses the vanilla model in terms of naturalness, intelligibility, and prosody while maintaining speaker anonymity.

Abstract

The increasing use of cloud-based speech assistants has heightened the need for effective speech anonymization, which aims to obscure a speaker's identity while retaining critical information for subsequent tasks. One approach to achieving this is through voice conversion. While existing methods often emphasize complex architectures and training techniques, our research underscores the importance of loss functions inspired by the human auditory system. Our proposed loss functions are model-agnostic, incorporating handcrafted and deep learning-based features to effectively capture quality representations. Through objective and subjective evaluations, we demonstrate that a VQVAE-based model, enhanced with our perception-driven losses, surpasses the vanilla model in terms of naturalness, intelligibility, and prosody while maintaining speaker anonymity. These improvements are consistently observed across various datasets, languages, target speakers, and genders.

Improving Voice Quality in Speech Anonymization With Just Perception-Informed Losses

TL;DR

It is demonstrated that a VQVAE-based model, enhanced with the authors' perception-driven losses, surpasses the vanilla model in terms of naturalness, intelligibility, and prosody while maintaining speaker anonymity.

Abstract

The increasing use of cloud-based speech assistants has heightened the need for effective speech anonymization, which aims to obscure a speaker's identity while retaining critical information for subsequent tasks. One approach to achieving this is through voice conversion. While existing methods often emphasize complex architectures and training techniques, our research underscores the importance of loss functions inspired by the human auditory system. Our proposed loss functions are model-agnostic, incorporating handcrafted and deep learning-based features to effectively capture quality representations. Through objective and subjective evaluations, we demonstrate that a VQVAE-based model, enhanced with our perception-driven losses, surpasses the vanilla model in terms of naturalness, intelligibility, and prosody while maintaining speaker anonymity. These improvements are consistently observed across various datasets, languages, target speakers, and genders.

Paper Structure

This paper contains 12 sections, 4 equations, 1 figure, 3 tables.

Figures (1)

  • Figure 1: User study results for different scenarios and all conversions (All conv.). The Speaker Similarity plot indicates the similarity between the source and converted utterances (lower is better). The MOS plot shows the naturalness ratings from the user study (higher is better). The Prosody and Intelligibility Votes plots show the percentage of votes each model received. The mean MOS of the original files is 3.54.