Table of Contents
Fetching ...

A Pilot Study of Applying Sequence-to-Sequence Voice Conversion to Evaluate the Intelligibility of L2 Speech Using a Native Speaker's Shadowings

Haopeng Geng, Daisuke Saito, Nobuaki Minematsu

TL;DR

Inspired by language teachers who correct students’ pronunciation through a voice-to-voice process, this pilot study utilizes a unique semi-parallel dataset composed of non-native speakers’ (L2) reading aloud, shadowing of native speakers (L1) and their script-shadowing utterances to create a virtual shadower system.

Abstract

Utterances by L2 speakers can be unintelligible due to mispronunciation and improper prosody. In computer-aided language learning systems, textual feedback is often provided using a speech recognition engine. However, an ideal form of feedback for L2 speakers should be so fine-grained that it enables them to detect and diagnose unintelligible parts of L2 speakers' utterances. Inspired by language teachers who correct students' pronunciation through a voice-to-voice process, this pilot study utilizes a unique semi-parallel dataset composed of non-native speakers' (L2) reading aloud, shadowing of native speakers (L1) and their script-shadowing utterances. We explore the technical possibility of replicating the process of an L1 speaker's shadowing L2 speech using Voice Conversion techniques, to create a virtual shadower system. Experimental results demonstrate the feasibility of the VC system in simulating L1's shadowing behavior. The output of the virtual shadower system shows a reasonable similarity to the real L1 shadowing utterances in both linguistic and acoustic aspects.

A Pilot Study of Applying Sequence-to-Sequence Voice Conversion to Evaluate the Intelligibility of L2 Speech Using a Native Speaker's Shadowings

TL;DR

Inspired by language teachers who correct students’ pronunciation through a voice-to-voice process, this pilot study utilizes a unique semi-parallel dataset composed of non-native speakers’ (L2) reading aloud, shadowing of native speakers (L1) and their script-shadowing utterances to create a virtual shadower system.

Abstract

Utterances by L2 speakers can be unintelligible due to mispronunciation and improper prosody. In computer-aided language learning systems, textual feedback is often provided using a speech recognition engine. However, an ideal form of feedback for L2 speakers should be so fine-grained that it enables them to detect and diagnose unintelligible parts of L2 speakers' utterances. Inspired by language teachers who correct students' pronunciation through a voice-to-voice process, this pilot study utilizes a unique semi-parallel dataset composed of non-native speakers' (L2) reading aloud, shadowing of native speakers (L1) and their script-shadowing utterances. We explore the technical possibility of replicating the process of an L1 speaker's shadowing L2 speech using Voice Conversion techniques, to create a virtual shadower system. Experimental results demonstrate the feasibility of the VC system in simulating L1's shadowing behavior. The output of the virtual shadower system shows a reasonable similarity to the real L1 shadowing utterances in both linguistic and acoustic aspects.
Paper Structure (23 sections, 2 equations, 3 figures, 2 tables)

This paper contains 23 sections, 2 equations, 3 figures, 2 tables.

Figures (3)

  • Figure 1: The concept of the virtual shadower, which simulates the shadowing behaviors of an L1 rater hearing a given L2 speech for the first time. Stuttering or inarticulate production of speech may occur due to listening disfluencies.
  • Figure 2: The proposed shadowing technique aims to identify unintelligible parts in an L2 speaker's reading aloud utterances ($L2_{R}$). In this approach, $L1_{S1}$ represents a native speaker’s initial shadowing, while $L1_{SS}$ denotes the script-shadowing by the native speaker. By calculating the distance between $L1_{S1}$ and $L1_{SS}$, it is possible to pinpoint the native speaker's listening breakdowns, which correspond to the unintelligible parts in the $L2_{R}$ as well.
  • Figure 3: The similarity of attention alignment to PPG-DTW is illustrated. The left figure shows the attention alignment observed in the inference phase of converting $L2_{R}$ to $L1_{S1}$, while the right figure shows the PPG-DTW path between $L2_{R}$ and $L1_{S1}$. Both figures exhibit a prominent diagonal contour.