CrossSpeech++: Cross-lingual Speech Synthesis with Decoupled Language and Speaker Generation

Ji-Hoon Kim; Hong-Sun Yang; Yoon-Cheol Ju; Il-Hwan Kim; Byeong-Yeol Kim; Joon Son Chung

CrossSpeech++: Cross-lingual Speech Synthesis with Decoupled Language and Speaker Generation

Ji-Hoon Kim, Hong-Sun Yang, Yoon-Cheol Ju, Il-Hwan Kim, Byeong-Yeol Kim, Joon Son Chung

TL;DR

CrossSpeech++ addresses the persistent challenge of language-speaker entanglement in cross-lingual TTS by decoupling language- and speaker-related generation into two dedicated acoustic generators. The Language-dependent Generator (LDG) and Speaker-dependent Generator (SDG) produce complementary representations, with components such as Mix-dynamic Speaker Layer Normalization (MDSLN), Language-dependent Variance (LDV) adaptor, linguistic adaptor, and Dynamic Speaker Layer Normalization (DSLN) plus SDV adaptor to control prosody and timbre. Extensive experiments across four languages show substantial improvements in cross-lingual naturalness and intelligibility (MOS/UTMOS/CER) while maintaining competitive speaker similarity, validated through ablations and comparisons with zero-shot baselines. The work advances practical multi-language synthesis by effectively disentangling language and speaker information in the output space and leveraging SSL-based linguistic features, though it acknowledges data requirements for low-resource languages and notes potential misuse risks.

Abstract

The goal of this work is to generate natural speech in multiple languages while maintaining the same speaker identity, a task known as cross-lingual speech synthesis. A key challenge of cross-lingual speech synthesis is the language-speaker entanglement problem, which causes the quality of cross-lingual systems to lag behind that of intra-lingual systems. In this paper, we propose CrossSpeech++, which effectively disentangles language and speaker information and significantly improves the quality of cross-lingual speech synthesis. To this end, we break the complex speech generation pipeline into two simple components: language-dependent and speaker-dependent generators. The language-dependent generator produces linguistic variations that are not biased by specific speaker attributes. The speaker-dependent generator models acoustic variations that characterize speaker identity. By handling each type of information in separate modules, our method can effectively disentangle language and speaker representation. We conduct extensive experiments using various metrics, and demonstrate that CrossSpeech++ achieves significant improvements in cross-lingual speech synthesis, outperforming existing methods by a large margin.

CrossSpeech++: Cross-lingual Speech Synthesis with Decoupled Language and Speaker Generation

TL;DR

Abstract

Paper Structure (27 sections, 11 equations, 6 figures, 5 tables)

This paper contains 27 sections, 11 equations, 6 figures, 5 tables.

Introduction
Related Works
Speech Synthesis
Cross-lingual Speech Synthesis
Domain Generalization
Model Architecture
Language-dependent Generator
Speaker Generalization
Language-dependent Variance Adaptor
Pitch
Energy
Linguistic Adaptor
Speaker-dependent Generator
Experimental Settings
Dataset
...and 12 more sections

Figures (6)

Figure 1: CrossSpeech++ operates as follows: From text inputs, the language-dependent generator produces language-dependent representations that capture linguistic characteristics drived solely from text inputs. The following speaker-dependent generator colorize speaker-specific attributes, and the mel-spectrogram is produced by summing representations from both generators. The output mel-spectrogram is then converted to an audible waveform by a pre-trained neural vocoder.
Figure 2: The overall architecutre of CrossSpeech++. ${\bf e}_\text{l}$ and ${\bf e}_\text{s}$ denote the language and speaker embeddings which are derived from trainable lookup tables. Detailed architectures of Language-dependent (LD) conformer encoder and Speaker-dependent (SD) conformer encoder are depicted in (b) and (c), respectively. MHA means multi-head attention. We replace the final Layer Normalization (LN) in the conformer with Mix-dynamic Speaker Layer Normalization (MDSLN) in the LD encoder and Dynamic Speaker Layer Normalization (DSLN) in the SD encoder.
Figure 3: Batch-wise shuffle operation. ${\bf e}_\text{s}$ is the speaker embeddings and $\Tilde{{\bf e}}_\text{s}$ denotes the shuffled speaker embeddings.
Figure 4: A pipeline for extracting the target linguistic features from waveform. To eliminate speaker-related features, the perturbed waveform is input to a multilingual wav2vec2.0. The following linguistic encoder extracts solely linguistic features, which are fed into a text predictor.
Figure 5: Visualization of language-dependent (LD) and speaker-dependent (SD) features. We visualize LD and SD features based on two different languages (en-Us and ko-KR) and spoken by four different speakers, i.e., English (EN), Korean (KR), Chinese (CN), and Japaneses (JP). Note that the LD feature remains invariant, while the SD feature varies across different speakers.
...and 1 more figures

CrossSpeech++: Cross-lingual Speech Synthesis with Decoupled Language and Speaker Generation

TL;DR

Abstract

CrossSpeech++: Cross-lingual Speech Synthesis with Decoupled Language and Speaker Generation

Authors

TL;DR

Abstract

Table of Contents

Figures (6)