Table of Contents
Fetching ...

AccentBox: Towards High-Fidelity Zero-Shot Accent Generation

Jinzuomu Zhong, Korin Richmond, Zhiba Su, Siqi Sun

TL;DR

The paper tackles poor accent fidelity in zero-shot TTS by introducing zero-shot accent generation, unifying FAC, accented TTS, and ZS-TTS. It proposes a two-stage framework: GenAID, a generalisable accent identification model across speakers that disentangles accent from speaker, and AccentBox, a zero-shot accent generator conditioned on GenAID embeddings. GenAID achieves state-of-the-art 0.56 f1 on unseen speakers across 13 accents, while AccentBox attains high accent fidelity in inherent and cross-accent generation and enables unseen accents. This approach improves accent control, reduces speaker-accent entanglement, and broadens practical applications such as CAPT and multilingual dubbing, with future work extending to L2 accents.

Abstract

While recent Zero-Shot Text-to-Speech (ZS-TTS) models have achieved high naturalness and speaker similarity, they fall short in accent fidelity and control. To address this issue, we propose zero-shot accent generation that unifies Foreign Accent Conversion (FAC), accented TTS, and ZS-TTS, with a novel two-stage pipeline. In the first stage, we achieve state-of-the-art (SOTA) on Accent Identification (AID) with 0.56 f1 score on unseen speakers. In the second stage, we condition a ZS-TTS system on the pretrained speaker-agnostic accent embeddings extracted by the AID model. The proposed system achieves higher accent fidelity on inherent/cross accent generation, and enables unseen accent generation.

AccentBox: Towards High-Fidelity Zero-Shot Accent Generation

TL;DR

The paper tackles poor accent fidelity in zero-shot TTS by introducing zero-shot accent generation, unifying FAC, accented TTS, and ZS-TTS. It proposes a two-stage framework: GenAID, a generalisable accent identification model across speakers that disentangles accent from speaker, and AccentBox, a zero-shot accent generator conditioned on GenAID embeddings. GenAID achieves state-of-the-art 0.56 f1 on unseen speakers across 13 accents, while AccentBox attains high accent fidelity in inherent and cross-accent generation and enables unseen accents. This approach improves accent control, reduces speaker-accent entanglement, and broadens practical applications such as CAPT and multilingual dubbing, with future work extending to L2 accents.

Abstract

While recent Zero-Shot Text-to-Speech (ZS-TTS) models have achieved high naturalness and speaker similarity, they fall short in accent fidelity and control. To address this issue, we propose zero-shot accent generation that unifies Foreign Accent Conversion (FAC), accented TTS, and ZS-TTS, with a novel two-stage pipeline. In the first stage, we achieve state-of-the-art (SOTA) on Accent Identification (AID) with 0.56 f1 score on unseen speakers. In the second stage, we condition a ZS-TTS system on the pretrained speaker-agnostic accent embeddings extracted by the AID model. The proposed system achieves higher accent fidelity on inherent/cross accent generation, and enables unseen accent generation.
Paper Structure (31 sections, 1 equation, 3 figures, 7 tables)

This paper contains 31 sections, 1 equation, 3 figures, 7 tables.

Figures (3)

  • Figure 1: Model architecture of GenAID.
  • Figure 2: Model architecture of AccentBox. The pretrained accent encoder (GenAID) is the same as in Fig. \ref{['fig:aid_methods']}.
  • Figure 3: T-SNE visualisation of embeddings by different AID systems on unseen speakers. (Each color represents an accent.)