Table of Contents
Fetching ...

Acoustic and perceptual differences between standard and accented Chinese speech and their voice clones

Tianle Yang, Chengzhe Sun, Phil Rose, Siwei Lyu

Abstract

Voice cloning is often evaluated in terms of overall quality, but less is known about accent preservation and its perceptual consequences. We compare standard and heavily accented Mandarin speech and their voice clones using a combined computational and perceptual design. Embedding-based analyses show no reliable accented-standard difference in original-clone distances across systems. In the perception study, clones are rated as more similar to their originals for standard than for accented speakers, and intelligibility increases from original to clone, with a larger gain for accented speech. These results show that accent variation can shape perceived identity match and intelligibility in voice cloning even when it is not reflected in an off-the-shelf speaker-embedding distance, and they motivate evaluating speaker identity preservation and accent preservation as separable dimensions.

Acoustic and perceptual differences between standard and accented Chinese speech and their voice clones

Abstract

Voice cloning is often evaluated in terms of overall quality, but less is known about accent preservation and its perceptual consequences. We compare standard and heavily accented Mandarin speech and their voice clones using a combined computational and perceptual design. Embedding-based analyses show no reliable accented-standard difference in original-clone distances across systems. In the perception study, clones are rated as more similar to their originals for standard than for accented speakers, and intelligibility increases from original to clone, with a larger gain for accented speech. These results show that accent variation can shape perceived identity match and intelligibility in voice cloning even when it is not reflected in an off-the-shelf speaker-embedding distance, and they motivate evaluating speaker identity preservation and accent preservation as separable dimensions.

Paper Structure

This paper contains 9 sections, 3 equations, 3 figures.

Figures (3)

  • Figure 1: Model-level ECAPA-TDNN cosine distances between original speech and voice clones for standard and accented Mandarin speakers. Top: mean original--cloned distance ($\overline{d}_{OC}$). Bottom: clone divergence ($\Delta_{\mathrm{div}}=\overline{d}_{OC}-\overline{d}_{OO}$); dashed line marks $\Delta_{\mathrm{div}}=0$. Error bars show 95% confidence intervals across speakers.
  • Figure 2: Listener-rated speaker similarity between each voice clone and its corresponding original recording for standard and accented Mandarin speakers, shown separately for different systems. Top: distributions of participant-level mean similarity ratings. Bottom: within-participant paired means for standard vs. accented.
  • Figure 3: Listener-rated intelligibility gain for each clone relative to its matched original recording, shown separately for speaker sets and for each system. Top: distributions of participant-level mean gains. Bottom: within-participant paired mean gains for Standard vs. Accented.