AccentBox: Towards High-Fidelity Zero-Shot Accent Generation
Jinzuomu Zhong, Korin Richmond, Zhiba Su, Siqi Sun
TL;DR
The paper tackles poor accent fidelity in zero-shot TTS by introducing zero-shot accent generation, unifying FAC, accented TTS, and ZS-TTS. It proposes a two-stage framework: GenAID, a generalisable accent identification model across speakers that disentangles accent from speaker, and AccentBox, a zero-shot accent generator conditioned on GenAID embeddings. GenAID achieves state-of-the-art 0.56 f1 on unseen speakers across 13 accents, while AccentBox attains high accent fidelity in inherent and cross-accent generation and enables unseen accents. This approach improves accent control, reduces speaker-accent entanglement, and broadens practical applications such as CAPT and multilingual dubbing, with future work extending to L2 accents.
Abstract
While recent Zero-Shot Text-to-Speech (ZS-TTS) models have achieved high naturalness and speaker similarity, they fall short in accent fidelity and control. To address this issue, we propose zero-shot accent generation that unifies Foreign Accent Conversion (FAC), accented TTS, and ZS-TTS, with a novel two-stage pipeline. In the first stage, we achieve state-of-the-art (SOTA) on Accent Identification (AID) with 0.56 f1 score on unseen speakers. In the second stage, we condition a ZS-TTS system on the pretrained speaker-agnostic accent embeddings extracted by the AID model. The proposed system achieves higher accent fidelity on inherent/cross accent generation, and enables unseen accent generation.
