Controllable Accent Normalization via Discrete Diffusion

Qibing Bai; Yuhan Du; Tom Ko; Shuai Wang; Yannan Wang; Haizhou Li

Controllable Accent Normalization via Discrete Diffusion

Qibing Bai, Yuhan Du, Tom Ko, Shuai Wang, Yannan Wang, Haizhou Li

Abstract

Existing accent normalization methods do not typically offer control over accent strength, yet many applications-such as language learning and dubbing-require tunable accent retention. We propose DLM-AN, a controllable accent normalization system built on masked discrete diffusion over self-supervised speech tokens. A Common Token Predictor identifies source tokens that likely encode native pronunciation; these tokens are selectively reused to initialize the reverse diffusion process. This provides a simple yet effective mechanism for controlling accent strength: reusing more tokens preserves more of the original accent. DLM-AN further incorporates a flow-matching Duration Ratio Predictor that automatically adjusts the total duration to better match the native rhythm. Experiments on multi-accent English data show that DLM-AN achieves the lowest word error rate among all compared systems while delivering competitive accent reduction and smooth, interpretable accent strength control.

Controllable Accent Normalization via Discrete Diffusion

Abstract

Paper Structure (29 sections, 13 equations, 7 figures, 2 tables, 1 algorithm)

This paper contains 29 sections, 13 equations, 7 figures, 2 tables, 1 algorithm.

Introduction
Related Work
Controllability in Accent Conversion
Self-Supervised Speech Tokens
Discrete Diffusion
Methodology
Discrete Diffusion for Speech Tokens
Common Token Prediction
Duration Ratio Prediction
Phoneme Guidance and Joint Training
Sampling Algorithm for Token Conversion
Token-to-Speech Synthesis
Experimental Setup
Datasets
Tokenizer
...and 14 more sections

Figures (7)

Figure 1: Overview of the DLM-AN pipeline. The SSL tokenizer extracts discrete tokens from L2-accented speech. A Transformer token encoder with CTC-based phonemic guidance produces content representations, which are fed into the Common Token Predictor (CTP), Duration Ratio Predictor (DP), and the DLM decoder. The DLM decoder iteratively generates the target token sequence, optionally initialized with high-CTP-confidence source tokens. A flow-matching synthesizer and vocoder produce the final waveform.
Figure 2: Structure of the DLM decoder. The input consists of the content features and the noised target tokens, separated by special tokens [START], [TASK], and [END]. The content features are mutually attentive but do not attend to the token sequence (pink region), while the token sequence attends to the entire input (green region).
Figure 3: Extraction of common token labels via the longest common subsequence (LCS) between source and target token sequences. For consecutive identical tokens with differing durations, center-mode alignment is applied (dashed rectangle).
Figure 4: Visualization of common token prediction for a Chinese-accented sample. CTP confidence values are overlaid on the Mel-spectrogram. PPG-predicted phonemes are shown below and their boundaries (white dashed lines) are overlaid on the spectrogram. Aligned words are shown at the bottom. Regions with prominent L2 accent (e.g., prolonged "a", unclear "had", /S/-like ending of "death") receive low CTP confidence.
Figure 5: CTP-based vs. random token selection at varying reuse proportions. Three metrics are compared: (a) WER, (b) $\Delta$PPG with the L1-accented target, and (c) $\Delta$PPG with the L2-accented source. CTP-based selection achieves generally lower WER and consistently better accent separation than random selection at the same proportion.
...and 2 more figures

Controllable Accent Normalization via Discrete Diffusion

Abstract

Controllable Accent Normalization via Discrete Diffusion

Authors

Abstract

Table of Contents

Figures (7)