Table of Contents
Fetching ...

Pairwise Evaluation of Accent Similarity in Speech Synthesis

Jinzuomu Zhong, Suyuan Liu, Dan Wells, Korin Richmond

TL;DR

The paper tackles evaluating accent similarity in speech synthesis, a neglected area with inconsistent subjective and objective methods. It advances subjective evaluation by refining the XAB listening test with transcription, highlighting, and listener screening to achieve higher significance with fewer listeners. For objective evaluation, it proposes pronunciation-based metrics based on vowel formants and DTW-aligned PPGs, and demonstrates strong correlations with subjective judgments alongside Cosine similarity of AID/SV embeddings and MCD. The results reveal limitations of WER/CER/UTMOS for underrepresented accents and advocate a more inclusive evaluation framework for accent generation in ZS-TTS, accented TTS, and accent conversion systems.

Abstract

Despite growing interest in generating high-fidelity accents, evaluating accent similarity in speech synthesis has been underexplored. We aim to enhance both subjective and objective evaluation methods for accent similarity. Subjectively, we refine the XAB listening test by adding components that achieve higher statistical significance with fewer listeners and lower costs. Our method involves providing listeners with transcriptions, having them highlight perceived accent differences, and implementing meticulous screening for reliability. Objectively, we utilise pronunciation-related metrics, based on distances between vowel formants and phonetic posteriorgrams, to evaluate accent generation. Comparative experiments reveal that these metrics, alongside accent similarity, speaker similarity, and Mel Cepstral Distortion, can be used. Moreover, our findings underscore significant limitations of common metrics like Word Error Rate in assessing underrepresented accents.

Pairwise Evaluation of Accent Similarity in Speech Synthesis

TL;DR

The paper tackles evaluating accent similarity in speech synthesis, a neglected area with inconsistent subjective and objective methods. It advances subjective evaluation by refining the XAB listening test with transcription, highlighting, and listener screening to achieve higher significance with fewer listeners. For objective evaluation, it proposes pronunciation-based metrics based on vowel formants and DTW-aligned PPGs, and demonstrates strong correlations with subjective judgments alongside Cosine similarity of AID/SV embeddings and MCD. The results reveal limitations of WER/CER/UTMOS for underrepresented accents and advocate a more inclusive evaluation framework for accent generation in ZS-TTS, accented TTS, and accent conversion systems.

Abstract

Despite growing interest in generating high-fidelity accents, evaluating accent similarity in speech synthesis has been underexplored. We aim to enhance both subjective and objective evaluation methods for accent similarity. Subjectively, we refine the XAB listening test by adding components that achieve higher statistical significance with fewer listeners and lower costs. Our method involves providing listeners with transcriptions, having them highlight perceived accent differences, and implementing meticulous screening for reliability. Objectively, we utilise pronunciation-related metrics, based on distances between vowel formants and phonetic posteriorgrams, to evaluate accent generation. Comparative experiments reveal that these metrics, alongside accent similarity, speaker similarity, and Mel Cepstral Distortion, can be used. Moreover, our findings underscore significant limitations of common metrics like Word Error Rate in assessing underrepresented accents.

Paper Structure

This paper contains 17 sections, 2 figures, 2 tables.

Figures (2)

  • Figure 2: Disparity of formant distribution between ground truth (shaded ellipses, dotted lines) and copysyn/xtts (hollow ellipses, solid lines) for speaker p252. Vowel symbols are ARPABET. F1/F2 axes are normalised for each speaker.
  • Figure : (a) AB preference% (mean±95%CI) of 15 valid submissions.