Cross-Dialect Text-To-Speech in Pitch-Accent Language Incorporating Multi-Dialect Phoneme-Level BERT

Kazuki Yamauchi; Yuki Saito; Hiroshi Saruwatari

Cross-Dialect Text-To-Speech in Pitch-Accent Language Incorporating Multi-Dialect Phoneme-Level BERT

Kazuki Yamauchi, Yuki Saito, Hiroshi Saruwatari

TL;DR

This work defines cross-dialect TTS (CD-TTS) for pitch-accent languages and introduces a three-module architecture: a backbone TTS with a VQ-VAE-based reference encoder to extract phoneme-level ALVs from reference speech, an ALV predictor that uses a dialect-id conditioned MD-PL-BERT pre-trained on a multi-dialect text corpus augmented by LLM-driven translations, and a two-stage training regime enabling cross-dialect voice synthesis and pitch-accent transfer. The model predicts dialect-specific ALVs from text to drive pitch-accent in synthesis, and supports transferring pitch-accent from an arbitrary speaker via reference speech. Experimental results on Japanese Osaka/Tokyo dialects show improved dialectality in CD-TTS without sacrificing intra-dialect naturalness, with BN features outperforming F0 for ALV extraction and LLM-based data augmentation enhancing dialect translation quality. Overall, the approach enables more natural and regionally localized TTS without relying on expensive accent dictionaries, with potential for broader applicability to multiple dialects and languages.

Abstract

We explore cross-dialect text-to-speech (CD-TTS), a task to synthesize learned speakers' voices in non-native dialects, especially in pitch-accent languages. CD-TTS is important for developing voice agents that naturally communicate with people across regions. We present a novel TTS model comprising three sub-modules to perform competitively at this task. We first train a backbone TTS model to synthesize dialect speech from a text conditioned on phoneme-level accent latent variables (ALVs) extracted from speech by a reference encoder. Then, we train an ALV predictor to predict ALVs tailored to a target dialect from input text leveraging our novel multi-dialect phoneme-level BERT. We conduct multi-dialect TTS experiments and evaluate the effectiveness of our model by comparing it with a baseline derived from conventional dialect TTS methods. The results show that our model improves the dialectal naturalness of synthetic speech in CD-TTS.

Cross-Dialect Text-To-Speech in Pitch-Accent Language Incorporating Multi-Dialect Phoneme-Level BERT

TL;DR

Abstract

Paper Structure (16 sections, 1 equation, 4 figures, 6 tables)

This paper contains 16 sections, 1 equation, 4 figures, 6 tables.

Introduction
Related Work
Prosody transfer
Data-driven pitch-accent modeling
Self-supervised pre-training on text data for TTS
Problems of conventional methods for dialect TTS
Method
Reference encoder
ALV predictor incorporating MD-PL-BERT
Training and inference
Experiments
Experimental conditions
Evaluations
Results and discussion
Ablation study
...and 1 more sections

Figures (4)

Figure 1: Flowchart of typical Japanese TTS model.
Figure 2: Overview of our proposed TTS model.
Figure 3: The architecture of our proposed model, consisting of a reference encoder and an ALV predictor. In the first training stage, the reference encoder and backbone TTS model are trained. In the second training stage, the ALV predictor is trained.
Figure 4: The violinplot of logarithmic fundamental frequency ($\log \mathrm{F0}$) aggrigated by ALV value ($0, 1, 2,$ or $3$).

Cross-Dialect Text-To-Speech in Pitch-Accent Language Incorporating Multi-Dialect Phoneme-Level BERT

TL;DR

Abstract

Cross-Dialect Text-To-Speech in Pitch-Accent Language Incorporating Multi-Dialect Phoneme-Level BERT

Authors

TL;DR

Abstract

Table of Contents

Figures (4)