Table of Contents
Fetching ...

MacST: Multi-Accent Speech Synthesis via Text Transliteration for Accent Conversion

Sho Inoue, Shuai Wang, Wanxing Wang, Pengcheng Zhu, Mengxiao Bi, Haizhou Li

TL;DR

MacST tackles the data scarcity barrier in accent conversion by introducing a transliteration-based pipeline that generates parallel accent data. By leveraging LLM-driven transliteration and multilingual TTS, it creates accented English samples from arbitrary input text, enabling scalable, speaker- and language-flexible data generation. The method yields a synthetic parallel corpus that, when used to train a conversion model, improves accentedness, naturalness, and speaker preservation, as shown by subjective MUSHRA scores and objective metrics such as $WER$, $AECS$, and $SECS$. This approach offers a practical path to data augmentation for accent conversion, with demonstrated benefits for native and non-native English accents and potential for broader linguistic coverage.

Abstract

In accented voice conversion or accent conversion, we seek to convert the accent in speech from one another while preserving speaker identity and semantic content. In this study, we formulate a novel method for creating multi-accented speech samples, thus pairs of accented speech samples by the same speaker, through text transliteration for training accent conversion systems. We begin by generating transliterated text with Large Language Models (LLMs), which is then fed into multilingual TTS models to synthesize accented English speech. As a reference system, we built a sequence-to-sequence model on the synthetic parallel corpus for accent conversion. We validated the proposed method for both native and non-native English speakers. Subjective and objective evaluations further validate our dataset's effectiveness in accent conversion studies.

MacST: Multi-Accent Speech Synthesis via Text Transliteration for Accent Conversion

TL;DR

MacST tackles the data scarcity barrier in accent conversion by introducing a transliteration-based pipeline that generates parallel accent data. By leveraging LLM-driven transliteration and multilingual TTS, it creates accented English samples from arbitrary input text, enabling scalable, speaker- and language-flexible data generation. The method yields a synthetic parallel corpus that, when used to train a conversion model, improves accentedness, naturalness, and speaker preservation, as shown by subjective MUSHRA scores and objective metrics such as , , and . This approach offers a practical path to data augmentation for accent conversion, with demonstrated benefits for native and non-native English accents and potential for broader linguistic coverage.

Abstract

In accented voice conversion or accent conversion, we seek to convert the accent in speech from one another while preserving speaker identity and semantic content. In this study, we formulate a novel method for creating multi-accented speech samples, thus pairs of accented speech samples by the same speaker, through text transliteration for training accent conversion systems. We begin by generating transliterated text with Large Language Models (LLMs), which is then fed into multilingual TTS models to synthesize accented English speech. As a reference system, we built a sequence-to-sequence model on the synthetic parallel corpus for accent conversion. We validated the proposed method for both native and non-native English speakers. Subjective and objective evaluations further validate our dataset's effectiveness in accent conversion studies.
Paper Structure (18 sections, 2 figures, 3 tables)

This paper contains 18 sections, 2 figures, 3 tables.

Figures (2)

  • Figure 1: Overall diagram of the MacST pipeline: The system first generates transliterated text from the input, which is then fed into multi-lingual TTS models to synthesize accented speech. Red texts denote our proposed system's input data (English Text, Target Language, and Speaker Information).
  • Figure 2: Examples of the transliteration process: The prompt of LLM and the expected response. Example responses are placed in $[$Few Shot Examples$]$.