Table of Contents
Fetching ...

Disentangling segmental and prosodic factors to non-native speech comprehensibility

Waris Quamer, Ricardo Gutierrez-Osuna

TL;DR

The paper presents an accent conversion framework that disentangles segmental and prosodic information, enabling independent manipulation of voice quality, segmentals, and prosody for non-native speech. By discretizing phonetic content with vector quantization and removing consecutive duplicates, the model forces reliance on a prosody embedding from a reference utterance, improving prosody transfer while preserving speaker similarity. Perceptual studies show segmental cues have a larger impact on comprehensibility than prosody, challenging some prior assumptions and highlighting the importance of accurate segmental articulation in non-native speech. The approach offers a tool for evaluating social attitudes toward accents and has potential applications in computer-assisted pronunciation training and targeted interventions to reduce bias.

Abstract

Current accent conversion (AC) systems do not disentangle the two main sources of non-native accent: segmental and prosodic characteristics. Being able to manipulate a non-native speaker's segmental and/or prosodic channels independently is critical to quantify how these two channels contribute to speech comprehensibility and social attitudes. We present an AC system that not only decouples voice quality from accent, but also disentangles the latter into its segmental and prosodic characteristics. The system is able to generate accent conversions that combine (1) the segmental characteristics from a source utterance, (2) the voice characteristics from a target utterance, and (3) the prosody of a reference utterance. We show that vector quantization of acoustic embeddings and removal of consecutive duplicated codewords allows the system to transfer prosody and improve voice similarity. We conduct perceptual listening tests to quantify the individual contributions of segmental features and prosody on the perceived comprehensibility of non-native speech. Our results indicate that, contrary to prior research in non-native speech, segmental features have a larger impact on comprehensibility than prosody. The proposed AC system may also be used to study how segmental and prosody cues affect social attitudes towards non-native speech.

Disentangling segmental and prosodic factors to non-native speech comprehensibility

TL;DR

The paper presents an accent conversion framework that disentangles segmental and prosodic information, enabling independent manipulation of voice quality, segmentals, and prosody for non-native speech. By discretizing phonetic content with vector quantization and removing consecutive duplicates, the model forces reliance on a prosody embedding from a reference utterance, improving prosody transfer while preserving speaker similarity. Perceptual studies show segmental cues have a larger impact on comprehensibility than prosody, challenging some prior assumptions and highlighting the importance of accurate segmental articulation in non-native speech. The approach offers a tool for evaluating social attitudes toward accents and has potential applications in computer-assisted pronunciation training and targeted interventions to reduce bias.

Abstract

Current accent conversion (AC) systems do not disentangle the two main sources of non-native accent: segmental and prosodic characteristics. Being able to manipulate a non-native speaker's segmental and/or prosodic channels independently is critical to quantify how these two channels contribute to speech comprehensibility and social attitudes. We present an AC system that not only decouples voice quality from accent, but also disentangles the latter into its segmental and prosodic characteristics. The system is able to generate accent conversions that combine (1) the segmental characteristics from a source utterance, (2) the voice characteristics from a target utterance, and (3) the prosody of a reference utterance. We show that vector quantization of acoustic embeddings and removal of consecutive duplicated codewords allows the system to transfer prosody and improve voice similarity. We conduct perceptual listening tests to quantify the individual contributions of segmental features and prosody on the perceived comprehensibility of non-native speech. Our results indicate that, contrary to prior research in non-native speech, segmental features have a larger impact on comprehensibility than prosody. The proposed AC system may also be used to study how segmental and prosody cues affect social attitudes towards non-native speech.
Paper Structure (22 sections, 3 figures, 9 tables)

This paper contains 22 sections, 3 figures, 9 tables.

Figures (3)

  • Figure 1: Block diagram of the proposed system. The prosody encoder and seq2seq model are trained jointly as an auto-encoder. For accent conversion, segmentals come from U1 and prosody from U3, thus providing independent control of both channels
  • Figure 2: Mel Cepstral Distortion (MCD) vs. codebook size. The lowest MCD is reached when the number of codewords is infinite (baseline system). MCD decreases as the number of codewords increases, and stabilizes after 128 codewords
  • Figure 3: t-SNE of speaker embeddings for source (black), target (blue), baseline $vq\infty$ (red) and proposed $vq128$ (green). The arrows represent a path connecting source and target utterances, passing through conversions from the two systems. Conversions from $vq128$ are much closer to the target than those from $vq\infty$, indicating that the $vq128$ system provides better transfer of speaker identity.