Improvement Speaker Similarity for Zero-Shot Any-to-Any Voice Conversion of Whispered and Regular Speech

Anastasia Avdeeva; Aleksei Gusev

Improvement Speaker Similarity for Zero-Shot Any-to-Any Voice Conversion of Whispered and Regular Speech

Anastasia Avdeeva, Aleksei Gusev

TL;DR

The paper tackles zero-shot any-to-any VC for both whispered and regular speech, aiming to maximize speaker similarity in real-time. It introduces SpeakerVC, a lightweight system based on StyleTTS2, augmented with a HuBERT-based encoder producing discrete units and a cosine speaker loss $L_{spk} = \frac{1}{N}\left(1 - \frac{X \cdot Y}{\|X\| \|Y\|}\right)$, trained on expanded datasets with whispered data. Three decoder variants (Tacotron2, FastSpeech2, SpeakerVC) are explored, with SpeakerVC further enhanced by an Acoustic Style Encoder and ECAPA-TDNN speaker embeddings, enabling strong EER and SIM-o performance while keeping streaming latency around 0.8 seconds. Across objective and subjective evaluations (WER, EER, SIM-o, SMOS), the proposed methods, especiallySpeakerVC, outperform several SOTA TTS/VC baselines in any-to-any scenarios, including whispered-to-speech tasks, and demonstrate robustness for streaming deployment. The work highlights that incorporating a dedicated speaker loss and scaling the speaker pool during training significantly improves speaker identity transfer, contributing to practical, real-time whisper-aware VC capabilities.

Abstract

Zero-shot voice conversion aims to transfer the voice of a source speaker to that of a speaker unseen during training, while preserving the content information. Although various methods have been proposed to reconstruct speaker information in generated speech, there is still room for improvement in achieving high similarity between generated and ground truth recordings. Furthermore, zero-shot voice conversion for speech in specific domains, such as whispered, remains an unexplored area. To address this problem, we propose a SpeakerVC model that can effectively perform zero-shot speech conversion in both voiced and whispered domains, while being lightweight and capable of running in streaming mode without significant quality degradation. In addition, we explore methods to improve the quality of speaker identity transfer and demonstrate their effectiveness for a variety of voice conversion systems.

Improvement Speaker Similarity for Zero-Shot Any-to-Any Voice Conversion of Whispered and Regular Speech

TL;DR

, trained on expanded datasets with whispered data. Three decoder variants (Tacotron2, FastSpeech2, SpeakerVC) are explored, with SpeakerVC further enhanced by an Acoustic Style Encoder and ECAPA-TDNN speaker embeddings, enabling strong EER and SIM-o performance while keeping streaming latency around 0.8 seconds. Across objective and subjective evaluations (WER, EER, SIM-o, SMOS), the proposed methods, especiallySpeakerVC, outperform several SOTA TTS/VC baselines in any-to-any scenarios, including whispered-to-speech tasks, and demonstrate robustness for streaming deployment. The work highlights that incorporating a dedicated speaker loss and scaling the speaker pool during training significantly improves speaker identity transfer, contributing to practical, real-time whisper-aware VC capabilities.

Abstract

Paper Structure (12 sections, 1 equation, 2 figures, 5 tables)

This paper contains 12 sections, 1 equation, 2 figures, 5 tables.

Introduction
Related works
Datasets
Systems description
Encoder
Speaker loss
Tacotron-based decoder
FastSpeech-based decoder
SpeakerVC
Evaluation metrics
Experiments
Discussion

Figures (2)

Figure 1: Proposed FastSpeech2-based VC system.
Figure 2: Proposed SpeakerVC system.

Improvement Speaker Similarity for Zero-Shot Any-to-Any Voice Conversion of Whispered and Regular Speech

TL;DR

Abstract

Improvement Speaker Similarity for Zero-Shot Any-to-Any Voice Conversion of Whispered and Regular Speech

Authors

TL;DR

Abstract

Table of Contents

Figures (2)