Acoustic and perceptual differences between standard and accented Chinese speech and their voice clones

Tianle Yang; Chengzhe Sun; Phil Rose; Siwei Lyu

Acoustic and perceptual differences between standard and accented Chinese speech and their voice clones

Tianle Yang, Chengzhe Sun, Phil Rose, Siwei Lyu

Abstract

Voice cloning is often evaluated in terms of overall quality, but less is known about accent preservation and its perceptual consequences. We compare standard and heavily accented Mandarin speech and their voice clones using a combined computational and perceptual design. Embedding-based analyses show no reliable accented-standard difference in original-clone distances across systems. In the perception study, clones are rated as more similar to their originals for standard than for accented speakers, and intelligibility increases from original to clone, with a larger gain for accented speech. These results show that accent variation can shape perceived identity match and intelligibility in voice cloning even when it is not reflected in an off-the-shelf speaker-embedding distance, and they motivate evaluating speaker identity preservation and accent preservation as separable dimensions.

Acoustic and perceptual differences between standard and accented Chinese speech and their voice clones

Abstract

Acoustic and perceptual differences between standard and accented Chinese speech and their voice clones

Abstract

Paper Structure

Table of Contents

Figures (3)