Learning-free L2-Accented Speech Generation using Phonological Rules

Thanathai Lertpetchpun; Yoonjeong Lee; Jihwan Lee; Tiantian Feng; Dani Byrd; Shrikanth Narayanan

Learning-free L2-Accented Speech Generation using Phonological Rules

Thanathai Lertpetchpun, Yoonjeong Lee, Jihwan Lee, Tiantian Feng, Dani Byrd, Shrikanth Narayanan

TL;DR

A accented TTS framework that combines phonological rules with a multilingual TTS model is proposed that combines phonological rules with a multilingual TTS model to transform accent at the phoneme level while preserving intelligibility.

Abstract

Accent plays a crucial role in speaker identity and inclusivity in speech technologies. Existing accented text-to-speech (TTS) systems either require large-scale accented datasets or lack fine-grained phoneme-level controllability. We propose a accented TTS framework that combines phonological rules with a multilingual TTS model. The rules are applied to phoneme sequences to transform accent at the phoneme level while preserving intelligibility. The method requires no accented training data and enables explicit phoneme-level accent manipulation. We design rule sets for Spanish- and Indian-accented English, modeling systematic differences in consonants, vowels, and syllable structure arising from phonotactic constraints. We analyze the trade-off between phoneme-level duration alignment and accent as realized in speech timing. Experimental results demonstrate effective accent shift while maintaining speech quality.

Learning-free L2-Accented Speech Generation using Phonological Rules

TL;DR

Abstract

Paper Structure (16 sections, 2 figures, 6 tables)

This paper contains 16 sections, 2 figures, 6 tables.

Introduction
Methods
Phonological Rules
Accented Speech Generation
Rhythmic Differences
Experiments
Datasets and Experimental Setup
Evaluation Metrics
Results and Discussion
Phonological Rules
Rhythmic Variation Effect
Effect of Each Phonological Rule
Subjective Evaluation
Conclusion
Acknowledgement
...and 1 more sections

Figures (2)

Figure 1: Synthesis Pipeline. We follow the synthesis and evaluation pipeline described in lertpetchpun2026quantifying. The key difference is that we condition the TTS model on a speaker embedding from the target language. In addition, we explicitly control whether duration alignment is applied between the American (US) phoneme sequence and the transformed target-accent phoneme sequence.
Figure 2: Human accent confusion matrix. Rows denote synthesis conditions (speaker embeddings with and without phonological rules), and columns represent listener-labeled accents.

Learning-free L2-Accented Speech Generation using Phonological Rules

TL;DR

Abstract

Learning-free L2-Accented Speech Generation using Phonological Rules

Authors

TL;DR

Abstract

Table of Contents

Figures (2)