Multi-Scale Accent Modeling and Disentangling for Multi-Speaker Multi-Accent Text-to-Speech Synthesis

Xuehao Zhou; Mingyang Zhang; Yi Zhou; Zhizheng Wu; Haizhou Li

Multi-Scale Accent Modeling and Disentangling for Multi-Speaker Multi-Accent Text-to-Speech Synthesis

Xuehao Zhou, Mingyang Zhang, Yi Zhou, Zhizheng Wu, Haizhou Li

TL;DR

The work tackles multi-speaker multi-accent TTS by jointly disentangling speaker and accent information and modeling accents at both utterance and phoneme levels. It introduces SIGAM (global, speaker-independent) and SILAM (local, phoneme-level, speaker-independent) plus LAPM (local accent predictor) and a two-stage training regime to enable inference without reference speech. Experiments on the L2-ARCTIC corpus show improvements in objective metrics like $MCD$, $F0$ accuracy, and duration, as well as subjective naturalness and accent similarity, with ablations validating each component’s contribution. The approach offers practical benefits for flexible, high-quality multi-speaker, multi-accent TTS, while noting trade-offs in speaker similarity for cross-accent generation and suggesting zero-shot and architecture-agnostic extensions for future work.

Abstract

Generating speech across different accents while preserving speaker identity is crucial for various real-world applications. However, accurately and independently modeling both speaker and accent characteristics in text-to-speech (TTS) systems is challenging due to the complex variations of accents and the inherent entanglement between speaker and accent identities. In this paper, we propose a novel approach for multi-speaker multi-accent TTS synthesis that aims to synthesize speech for multiple speakers, each with various accents. Our approach employs a multi-scale accent modeling strategy to address accent variations on different levels. Specifically, we introduce both global (utterance level) and local (phoneme level) accent modeling to capture overall accent characteristics within an utterance and fine-grained accent variations across phonemes, respectively. To enable independent control of speakers and accents, we use the speaker embedding to represent speaker identity and achieve speaker-independent accent control through speaker disentanglement within the multi-scale accent modeling. Additionally, we present a local accent prediction model that enables our system to generate accented speech directly from phoneme inputs. We conduct extensive experiments on an English accented speech corpus. Experimental results demonstrate that our proposed system outperforms baseline systems in terms of speech quality and accent rendering for generating multi-speaker multi-accent speech. Ablation studies further validate the effectiveness of different components in our proposed system.

Multi-Scale Accent Modeling and Disentangling for Multi-Speaker Multi-Accent Text-to-Speech Synthesis

TL;DR

accuracy, and duration, as well as subjective naturalness and accent similarity, with ablations validating each component’s contribution. The approach offers practical benefits for flexible, high-quality multi-speaker, multi-accent TTS, while noting trade-offs in speaker similarity for cross-accent generation and suggesting zero-shot and architecture-agnostic extensions for future work.

Abstract

Paper Structure (26 sections, 2 equations, 4 figures, 6 tables)

This paper contains 26 sections, 2 equations, 4 figures, 6 tables.

Introduction
Related Work
Expressive TTS
Accented TTS
Methodology
Acoustic Model (AM)
Speaker-Independent Global Accent Model (SIGAM)
Speaker-Independent Local Accent Model (SILAM)
Local Accent Prediction Model (LAPM)
Training Stages
Inference Stage
Experimental Setup
Database
Implementations
Evaluation Metrics
...and 11 more sections

Figures (4)

Figure 1: The architecture of the proposed multi-speaker multi-accent TTS framework in two training stages. The first stage is to train the acoustic model with the speaker-independent global and local accent models, and the second stage is to train the local accent prediction model. The speaker embedding $H_{S}$ is extracted from a pre-trained speaker encoder. Speech waveforms are generated by a pre-trained neural vocoder from the predicted Mel-spectrogram.
Figure 2: The architecture of (a) Decoder, (b) Global Accent Encoder, (c) Local Accent Encoder, (d) Local Accent Predictor. LN denotes layer normalization.
Figure 3: Visualizations of utterance level embeddings extracted from the ground truth accented speech by different models: (a) GST, (b) VAE, (c) SIGAM.
Figure 4: Visualizations of utterance level accent embeddings extracted from the ground truth accented speech by two models: GAM in the first row and SIGAM in the second row.

Multi-Scale Accent Modeling and Disentangling for Multi-Speaker Multi-Accent Text-to-Speech Synthesis

TL;DR

Abstract

Multi-Scale Accent Modeling and Disentangling for Multi-Speaker Multi-Accent Text-to-Speech Synthesis

Authors

TL;DR

Abstract

Table of Contents

Figures (4)