CosyAccent: Duration-Controllable Accent Normalization Using Source-Synthesis Training Data

Qibing Bai; Shuhao Shi; Shuai Wang; Yukai Ju; Yannan Wang; Haizhou Li

CosyAccent: Duration-Controllable Accent Normalization Using Source-Synthesis Training Data

Qibing Bai, Shuhao Shi, Shuai Wang, Yukai Ju, Yannan Wang, Haizhou Li

TL;DR

The paper tackles accent normalization under limited parallel L1-L2 data and the risk of TTS artifacts. It introduces a source-synthesis data pipeline that generates L2 source speech from a large native L1 corpus and uses authentic L1 targets, enabling training without real L2 data and avoiding TTS artifacts, while employing a total-duration scaling ratio $R = \frac{L_{target}}{L_{source}}$. The CosyAccent model, a non-autoregressive architecture with implicit rhythm modeling and explicit duration control, is trained exclusively on synthetic data and achieves superior content preservation and naturalness compared with strong baselines trained on real L2 data. This scalable approach is particularly well-suited for dubbing and personalized TTS, delivering high-quality accent normalization without dependence on real L2 data.

Abstract

Accent normalization (AN) systems often struggle with unnatural outputs and undesired content distortion, stemming from both suboptimal training data and rigid duration modeling. In this paper, we propose a "source-synthesis" methodology for training data construction. By generating source L2 speech and using authentic native speech as the training target, our approach avoids learning from TTS artifacts and, crucially, requires no real L2 data in training. Alongside this data strategy, we introduce CosyAccent, a non-autoregressive model that resolves the trade-off between prosodic naturalness and duration control. CosyAccent implicitly models rhythm for flexibility yet offers explicit control over total output duration. Experiments show that, despite being trained without any real L2 speech, CosyAccent achieves significantly improved content preservation and superior naturalness compared to strong baselines trained on real-world data.

CosyAccent: Duration-Controllable Accent Normalization Using Source-Synthesis Training Data

TL;DR

. The CosyAccent model, a non-autoregressive architecture with implicit rhythm modeling and explicit duration control, is trained exclusively on synthetic data and achieves superior content preservation and naturalness compared with strong baselines trained on real L2 data. This scalable approach is particularly well-suited for dubbing and personalized TTS, delivering high-quality accent normalization without dependence on real L2 data.

Abstract

Paper Structure (17 sections, 1 equation, 3 figures, 3 tables)

This paper contains 17 sections, 1 equation, 3 figures, 3 tables.

Introduction
Related-Work
Synthetic Data for Accent Conversion
Duration Modeling in Speech Conversion
Method
Construction of Training data
Model Architecture
Experimental Setup
Datasets
Compared Systems
Evaluation Data & Metrics
Results
Comparison with Frame-to-Frame Baseline
Comparison with Token-Based Baseline
Ablation Study
...and 2 more sections

Figures (3)

Figure 1: Construction pipeline of the paired training data.
Figure 2: CosyAccent architecture. It implicitly models rhythm for prosodic flexibility, while allowing the total duration to be either specified or predicted.
Figure 3: Speech decoder's alignment mechanism. Positional indices for the source content features are scaled to match the target's length, creating a coarse alignment within RoPE-based cross attention.

CosyAccent: Duration-Controllable Accent Normalization Using Source-Synthesis Training Data

TL;DR

Abstract

CosyAccent: Duration-Controllable Accent Normalization Using Source-Synthesis Training Data

Authors

TL;DR

Abstract

Table of Contents

Figures (3)