Table of Contents
Fetching ...

A Preliminary Analysis of Automatic Word and Syllable Prominence Detection in Non-Native Speech With Text-to-Speech Prosody Embeddings

Anindita Mondal, Rangavajjala Sankara Bharadwaj, Jhansi Mallela, Anil Kumar Vuppala, Chiranjeevi Yarra

TL;DR

This work investigates whether prosody embeddings learned by FastSpeech2, specifically via its variance adaptor, can support word- and syllable-level prominence detection in native and non-native English. It compares two extraction modes—text-only and speech-plus-text—using energy, duration, and pitch embeddings, and benchmarks against heuristics and self-supervised Wav2Vec-2.0 on native (Tatoeba) and non-native (ISLE) corpora. The findings show that energy-based embeddings, particularly from the speech-plus-text path, yield meaningful improvements in prominence discrimination, with up to 13.7–16.2% relative gains over baselines, and stronger performance for German than Italian speakers. These results support the utility of TTS-derived prosody embeddings for CALL and non-native speech analysis, while highlighting language-specific effects and epoch-dependent improvements in embedding quality.

Abstract

Automatic detection of prominence at the word and syllable-levels is critical for building computer-assisted language learning systems. It has been shown that prosody embeddings learned by the current state-of-the-art (SOTA) text-to-speech (TTS) systems could generate word- and syllable-level prominence in the synthesized speech as natural as in native speech. To understand the effectiveness of prosody embeddings from TTS for prominence detection under nonnative context, a comparative analysis is conducted on the embeddings extracted from native and non-native speech considering the prominence-related embeddings: duration, energy, and pitch from a SOTA TTS named FastSpeech2. These embeddings are extracted under two conditions considering: 1) only text, 2) both speech and text. For the first condition, the embeddings are extracted directly from the TTS inference mode, whereas for the second condition, we propose to extract from the TTS under training mode. Experiments are conducted on native speech corpus: Tatoeba, and non-native speech corpus: ISLE. For experimentation, word-level prominence locations are manually annotated for both corpora. The highest relative improvement on word \& syllable-level prominence detection accuracies with the TTS embeddings are found to be 13.7% & 5.9% and 16.2% & 6.9% compared to those with the heuristic-based features and self-supervised Wav2Vec-2.0 representations, respectively.

A Preliminary Analysis of Automatic Word and Syllable Prominence Detection in Non-Native Speech With Text-to-Speech Prosody Embeddings

TL;DR

This work investigates whether prosody embeddings learned by FastSpeech2, specifically via its variance adaptor, can support word- and syllable-level prominence detection in native and non-native English. It compares two extraction modes—text-only and speech-plus-text—using energy, duration, and pitch embeddings, and benchmarks against heuristics and self-supervised Wav2Vec-2.0 on native (Tatoeba) and non-native (ISLE) corpora. The findings show that energy-based embeddings, particularly from the speech-plus-text path, yield meaningful improvements in prominence discrimination, with up to 13.7–16.2% relative gains over baselines, and stronger performance for German than Italian speakers. These results support the utility of TTS-derived prosody embeddings for CALL and non-native speech analysis, while highlighting language-specific effects and epoch-dependent improvements in embedding quality.

Abstract

Automatic detection of prominence at the word and syllable-levels is critical for building computer-assisted language learning systems. It has been shown that prosody embeddings learned by the current state-of-the-art (SOTA) text-to-speech (TTS) systems could generate word- and syllable-level prominence in the synthesized speech as natural as in native speech. To understand the effectiveness of prosody embeddings from TTS for prominence detection under nonnative context, a comparative analysis is conducted on the embeddings extracted from native and non-native speech considering the prominence-related embeddings: duration, energy, and pitch from a SOTA TTS named FastSpeech2. These embeddings are extracted under two conditions considering: 1) only text, 2) both speech and text. For the first condition, the embeddings are extracted directly from the TTS inference mode, whereas for the second condition, we propose to extract from the TTS under training mode. Experiments are conducted on native speech corpus: Tatoeba, and non-native speech corpus: ISLE. For experimentation, word-level prominence locations are manually annotated for both corpora. The highest relative improvement on word \& syllable-level prominence detection accuracies with the TTS embeddings are found to be 13.7% & 5.9% and 16.2% & 6.9% compared to those with the heuristic-based features and self-supervised Wav2Vec-2.0 representations, respectively.

Paper Structure

This paper contains 12 sections, 4 figures, 1 table.

Figures (4)

  • Figure 1: Block Diagram showing the prosposed approach for obtaining embeddings from Fastspeech2 variance adaptor
  • Figure 2: Scatterplots comparing two principal components under Speech+Text and Text-only cases.('N' denotes native, 'NN' denotes Non-native, 'W': word-level, 'S': syllable-level, 'E': Energy, 'D': Duration and 'P':Pitch)
  • Figure 3: Comparison of Distance metrics for native and non-native speech under Text-only and Speech+Text cases (the distances enclosed within the box indicate similarity measures)
  • Figure 4: Comparison of Epoch-wise Accuracies with K-Means and DNN classification for GER and ITA considering word-level and syllable-level prominence detection.