Speech Rhythm-Based Speaker Embeddings Extraction from Phonemes and Phoneme Duration for Multi-Speaker Speech Synthesis

Kenichi Fujita; Atsushi Ando; Yusuke Ijima

Speech Rhythm-Based Speaker Embeddings Extraction from Phonemes and Phoneme Duration for Multi-Speaker Speech Synthesis

Kenichi Fujita, Atsushi Ando, Yusuke Ijima

TL;DR

The paper tackles the limitation of conventional speaker embeddings that largely model spectral features and F0, by introducing rhythm-based embeddings derived from phoneme sequences and durations to capture speaking rhythm. It builds a rhythm-focused embedding extractor using a bundle block, Transformer encoder, and self-attentive pooling to produce a 32-dimensional embedding, trained via a speaker-identification objective on phoneme-duration inputs. Empirical results show the rhythm-based embeddings achieve competitive speaker identification (EER around $15.2\%$ with large speaker sets), outperform spectral-feature-based embeddings in rhythm-related objective and subjective evaluations, and provide robust duration prediction even with automatically estimated phoneme durations. Analyses of embedding space and rhythm relationships demonstrate that the proposed embeddings align with rhythm similarity more closely than x-vector, supporting improved rhythm-consistent speech synthesis with limited target data and no retraining.

Abstract

This paper proposes a speech rhythm-based method for speaker embeddings to model phoneme duration using a few utterances by the target speaker. Speech rhythm is one of the essential factors among speaker characteristics, along with acoustic features such as F0, for reproducing individual utterances in speech synthesis. A novel feature of the proposed method is the rhythm-based embeddings extracted from phonemes and their durations, which are known to be related to speaking rhythm. They are extracted with a speaker identification model similar to the conventional spectral feature-based one. We conducted three experiments, speaker embeddings generation, speech synthesis with generated embeddings, and embedding space analysis, to evaluate the performance. The proposed method demonstrated a moderate speaker identification performance (15.2% EER), even with only phonemes and their duration information. The objective and subjective evaluation results demonstrated that the proposed method can synthesize speech with speech rhythm closer to the target speaker than the conventional method. We also visualized the embeddings to evaluate the relationship between the distance of the embeddings and the perceptual similarity. The visualization of the embedding space and the relation analysis between the closeness indicated that the distribution of embeddings reflects the subjective and objective similarity.

Speech Rhythm-Based Speaker Embeddings Extraction from Phonemes and Phoneme Duration for Multi-Speaker Speech Synthesis

TL;DR

with large speaker sets), outperform spectral-feature-based embeddings in rhythm-related objective and subjective evaluations, and provide robust duration prediction even with automatically estimated phoneme durations. Analyses of embedding space and rhythm relationships demonstrate that the proposed embeddings align with rhythm similarity more closely than x-vector, supporting improved rhythm-consistent speech synthesis with limited target data and no retraining.

Abstract

Paper Structure (27 sections, 4 equations, 15 figures, 2 tables)

This paper contains 27 sections, 4 equations, 15 figures, 2 tables.

Introduction
Related work
Multi-speaker speech synthesis
Embedding extraction and speech rhythm modeling
Speaker embedding method
x-vector
Proposed rhythm-based speaker embeddings
Input features
Bundle block
Transformer encoder block
Attention block
Experiments on speaker identification
Dataset
Model configurations
Evaluation of speaker identification performance
...and 12 more sections

Figures (15)

Figure 1: Comparison of conventional and proposed methods.
Figure 2: Bundle block.
Figure 3: Results of speaker identification.
Figure 4: Speaking rates.
Figure 5: Root mean square errors.
...and 10 more figures

Speech Rhythm-Based Speaker Embeddings Extraction from Phonemes and Phoneme Duration for Multi-Speaker Speech Synthesis

TL;DR

Abstract

Speech Rhythm-Based Speaker Embeddings Extraction from Phonemes and Phoneme Duration for Multi-Speaker Speech Synthesis

Authors

TL;DR

Abstract

Table of Contents

Figures (15)