Improvement Speaker Similarity for Zero-Shot Any-to-Any Voice Conversion of Whispered and Regular Speech
Anastasia Avdeeva, Aleksei Gusev
TL;DR
The paper tackles zero-shot any-to-any VC for both whispered and regular speech, aiming to maximize speaker similarity in real-time. It introduces SpeakerVC, a lightweight system based on StyleTTS2, augmented with a HuBERT-based encoder producing discrete units and a cosine speaker loss $L_{spk} = \frac{1}{N}\left(1 - \frac{X \cdot Y}{\|X\| \|Y\|}\right)$, trained on expanded datasets with whispered data. Three decoder variants (Tacotron2, FastSpeech2, SpeakerVC) are explored, with SpeakerVC further enhanced by an Acoustic Style Encoder and ECAPA-TDNN speaker embeddings, enabling strong EER and SIM-o performance while keeping streaming latency around 0.8 seconds. Across objective and subjective evaluations (WER, EER, SIM-o, SMOS), the proposed methods, especiallySpeakerVC, outperform several SOTA TTS/VC baselines in any-to-any scenarios, including whispered-to-speech tasks, and demonstrate robustness for streaming deployment. The work highlights that incorporating a dedicated speaker loss and scaling the speaker pool during training significantly improves speaker identity transfer, contributing to practical, real-time whisper-aware VC capabilities.
Abstract
Zero-shot voice conversion aims to transfer the voice of a source speaker to that of a speaker unseen during training, while preserving the content information. Although various methods have been proposed to reconstruct speaker information in generated speech, there is still room for improvement in achieving high similarity between generated and ground truth recordings. Furthermore, zero-shot voice conversion for speech in specific domains, such as whispered, remains an unexplored area. To address this problem, we propose a SpeakerVC model that can effectively perform zero-shot speech conversion in both voiced and whispered domains, while being lightweight and capable of running in streaming mode without significant quality degradation. In addition, we explore methods to improve the quality of speaker identity transfer and demonstrate their effectiveness for a variety of voice conversion systems.
