Accent Vector: Controllable Accent Manipulation for Multilingual TTS Without Accented Data

Thanathai Lertpetchpun; Thanapat Trachu; Jihwan Lee; Tiantian Feng; Dani Byrd; Shrikanth Narayanan

Accent Vector: Controllable Accent Manipulation for Multilingual TTS Without Accented Data

Thanathai Lertpetchpun, Thanapat Trachu, Jihwan Lee, Tiantian Feng, Dani Byrd, Shrikanth Narayanan

TL;DR

Object and human evaluations confirm the effectiveness of Accent Vector for fine-grained and compositional accent control and it generalizes beyond English, enabling accent control across multiple languages.

Abstract

Accent is an integral part of society, reflecting multiculturalism and shaping how individuals express identity. The majority of English speakers are non-native (L2) speakers, yet current Text-To-Speech (TTS) systems primarily model American-accented English due limited accented data. We propose \textit{Accent Vector}, a controllable representation that enables accent manipulation in multilingual TTS without requiring accented training data. \textit{Accent Vector} is derived by fine-tuning a TTS system on native speech of a different language (i.e. non-English) and computing task vectors capturing accent characteristics (i.e. in English). By scaling and interpolating the vector, we achieve fine-grained control over accent strength and generate mixed-accent speech. In addition, it generalizes beyond English, enabling accent control across multiple languages. Objective and human evaluations confirm the effectiveness of Accent Vector for fine-grained and compositional accent control.

Accent Vector: Controllable Accent Manipulation for Multilingual TTS Without Accented Data

TL;DR

Abstract

Paper Structure (38 sections, 5 equations, 5 figures, 6 tables)

This paper contains 38 sections, 5 equations, 5 figures, 6 tables.

Introduction
Background and Related work
Task Vector
XTTS
Accented TTS
Method
Fine-tuning Procedure
Obtaining Accent Vector
Accent Vector Arithmetic
Inference Procedure
Experimental Setup
Text-to-speech Modeling
Datasets
Evaluation Metrics
Objective Accent Evaluation
...and 23 more sections

Figures (5)

Figure 1: Accent Vector Framework. The top panel illustrates the fine-tuning procedure, and the bottom panel shows the inference process for generating accented speech. During inference, a language ID token (e.g., [en]) is concatenated with the transcript and provided as input to the model.
Figure 2: Accent Vector Arithmetic. The figure illustrates the computation of the Accent Vector, its scaling for accent strength control, and the interpolation of multiple Accent Vectors for mixed-accent synthesis.
Figure 3: Control of accent strength using the task vector coefficient. We evaluate the model fine-tuned on British and Hindi by measuring accent probability and word error rate (WER) across different task vector coefficients. The first and second rows correspond to the results of British- and Hindi-accented English, respectively. A clear trade-off between accent strength and intelligibility is observed; increasing the coefficient produces a stronger accent but results in higher WER.
Figure 4: Effect of the task vector coefficient when mixing multiple Accent Vectors. We combine Spanish and English Accent Vectors using coefficients $\alpha$ and $1-\alpha$, respectively. Accent probabilities for Spanish and English accents are evaluated across different values of $\alpha$. The vertical dashed line indicates the accent probability of the pretrained model of each accent.
Figure 5: Confusion matrix on the target accents and perceived accents by human listeners.

Accent Vector: Controllable Accent Manipulation for Multilingual TTS Without Accented Data

TL;DR

Abstract

Accent Vector: Controllable Accent Manipulation for Multilingual TTS Without Accented Data

Authors

TL;DR

Abstract

Table of Contents

Figures (5)