Voice Adaptation for Swiss German
Samuel Stucki, Jan Deriu, Mark Cieliebak
TL;DR
This work tackles adapting voice adaptation to Swiss German dialects by building a large weakly labeled SRG corpus from Swiss podcasts and fine-tuning XTTS-v2 for Standard German-to-Swiss German speech synthesis. It introduces a complete pipeline—VAD/diarization, Whisper-Large-V3 transcription, and a phoneme-based dialect classifier—to create ~4,979 hours (1.7M samples) of labeled data and train three XTTS-v2 variants. Automated and human evaluations show that models trained on SRG data can render dialects with high intelligibility and speaker similarity, though distribution mismatch can affect Short-text performance; Zurich remains challenging for dialect labeling. The results demonstrate a viable path for underrepresented language voice adaptation and point to future gains from improved sentence-based segmentation and dialect-aware training.
Abstract
This work investigates the performance of Voice Adaptation models for Swiss German dialects, i.e., translating Standard German text to Swiss German dialect speech. For this, we preprocess a large dataset of Swiss podcasts, which we automatically transcribe and annotate with dialect classes, yielding approximately 5000 hours of weakly labeled training material. We fine-tune the XTTSv2 model on this dataset and show that it achieves good scores in human and automated evaluations and can correctly render the desired dialect. Our work shows a step towards adapting Voice Cloning technology to underrepresented languages. The resulting model achieves CMOS scores of up to -0.28 and SMOS scores of 3.8.
