Cross-Dialect Text-To-Speech in Pitch-Accent Language Incorporating Multi-Dialect Phoneme-Level BERT
Kazuki Yamauchi, Yuki Saito, Hiroshi Saruwatari
TL;DR
This work defines cross-dialect TTS (CD-TTS) for pitch-accent languages and introduces a three-module architecture: a backbone TTS with a VQ-VAE-based reference encoder to extract phoneme-level ALVs from reference speech, an ALV predictor that uses a dialect-id conditioned MD-PL-BERT pre-trained on a multi-dialect text corpus augmented by LLM-driven translations, and a two-stage training regime enabling cross-dialect voice synthesis and pitch-accent transfer. The model predicts dialect-specific ALVs from text to drive pitch-accent in synthesis, and supports transferring pitch-accent from an arbitrary speaker via reference speech. Experimental results on Japanese Osaka/Tokyo dialects show improved dialectality in CD-TTS without sacrificing intra-dialect naturalness, with BN features outperforming F0 for ALV extraction and LLM-based data augmentation enhancing dialect translation quality. Overall, the approach enables more natural and regionally localized TTS without relying on expensive accent dictionaries, with potential for broader applicability to multiple dialects and languages.
Abstract
We explore cross-dialect text-to-speech (CD-TTS), a task to synthesize learned speakers' voices in non-native dialects, especially in pitch-accent languages. CD-TTS is important for developing voice agents that naturally communicate with people across regions. We present a novel TTS model comprising three sub-modules to perform competitively at this task. We first train a backbone TTS model to synthesize dialect speech from a text conditioned on phoneme-level accent latent variables (ALVs) extracted from speech by a reference encoder. Then, we train an ALV predictor to predict ALVs tailored to a target dialect from input text leveraging our novel multi-dialect phoneme-level BERT. We conduct multi-dialect TTS experiments and evaluate the effectiveness of our model by comparing it with a baseline derived from conventional dialect TTS methods. The results show that our model improves the dialectal naturalness of synthetic speech in CD-TTS.
