Table of Contents
Fetching ...

SIToBI -- A Speech Prosody Annotation Tool for Indian Languages

Preethi Thinakaran, Malarvizhi Muthuramalingam, Sooriya S, Anushiya Rachel Gladston, P. Vijayalakshmi, Hema A Murthy, T. Nagarajan

TL;DR

SIToBI addresses the labor-intensive task of prosody annotation for Indian languages by extending the ToBI framework with time-aligned phoneme, syllable, and word transcriptions, a relative intensity index, break indices, and syllable-level pitch contours. The tool uses language-independent ASR-augmented segmentation via forced-Viterbi alignment and monophone HMMs to enable scalable, multilingual annotation across Tamil, Hindi, and Indian English. It achieves high accuracy in segmentation ($\approx$10–30 ms error), break indices (≈$95\%$), and pitch contour labeling (≈$99\%$), and demonstrates that pitch contours can aid language identification, especially for longer words. Overall, SIToBI provides a scalable framework for multilingual prosody analysis with potential impact on TTS, ASR, and S2S systems in diverse Indian languages and other syllable-timed languages.

Abstract

The availability of prosodic information from speech signals is useful in a wide range of applications. However, deriving this information from speech signals can be a laborious task involving manual intervention. Therefore, the current work focuses on developing a tool that can provide prosodic annotations corresponding to a given speech signal, particularly for Indian languages. The proposed Segmentation with Intensity, Tones and Break Indices (SIToBI) tool provides time-aligned phoneme, syllable, and word transcriptions, syllable-level pitch contour annotations, break indices, and syllable-level relative intensity indices. The tool focuses more on syllable-level annotations since Indian languages are syllable-timed. Indians, regardless of the language they speak, may exhibit influences from other languages. As a result, other languages spoken in India may also exhibit syllable-timed characteristics. The accuracy of the annotations derived from the tool is analyzed by comparing them against manual annotations and the tool is observed to perform well. While the current work focuses on three languages, namely, Tamil, Hindi, and Indian English, the tool can easily be extended to other Indian languages and possibly other syllable-timed languages as well.

SIToBI -- A Speech Prosody Annotation Tool for Indian Languages

TL;DR

SIToBI addresses the labor-intensive task of prosody annotation for Indian languages by extending the ToBI framework with time-aligned phoneme, syllable, and word transcriptions, a relative intensity index, break indices, and syllable-level pitch contours. The tool uses language-independent ASR-augmented segmentation via forced-Viterbi alignment and monophone HMMs to enable scalable, multilingual annotation across Tamil, Hindi, and Indian English. It achieves high accuracy in segmentation (10–30 ms error), break indices (≈), and pitch contour labeling (≈), and demonstrates that pitch contours can aid language identification, especially for longer words. Overall, SIToBI provides a scalable framework for multilingual prosody analysis with potential impact on TTS, ASR, and S2S systems in diverse Indian languages and other syllable-timed languages.

Abstract

The availability of prosodic information from speech signals is useful in a wide range of applications. However, deriving this information from speech signals can be a laborious task involving manual intervention. Therefore, the current work focuses on developing a tool that can provide prosodic annotations corresponding to a given speech signal, particularly for Indian languages. The proposed Segmentation with Intensity, Tones and Break Indices (SIToBI) tool provides time-aligned phoneme, syllable, and word transcriptions, syllable-level pitch contour annotations, break indices, and syllable-level relative intensity indices. The tool focuses more on syllable-level annotations since Indian languages are syllable-timed. Indians, regardless of the language they speak, may exhibit influences from other languages. As a result, other languages spoken in India may also exhibit syllable-timed characteristics. The accuracy of the annotations derived from the tool is analyzed by comparing them against manual annotations and the tool is observed to perform well. While the current work focuses on three languages, namely, Tamil, Hindi, and Indian English, the tool can easily be extended to other Indian languages and possibly other syllable-timed languages as well.

Paper Structure

This paper contains 14 sections, 3 equations, 6 figures, 2 tables.

Figures (6)

  • Figure 1: Annotations from the SIToBI Tool for the English sentence, "For the twentieth time that evening the two men shook hands."
  • Figure 2: Block diagram
  • Figure 3: Basic Pitch Contour Shapes Considered: (a) L, (b) H, (c) HHL, (d) LHH, (e) HLL, (f) LLH, (g) HLH, (h) LHL, (i) hat, (j) bucket
  • Figure 4: Comparison of Segmentation Error for Language-Independent and Language-Dependent Models (Tamil, Hindi, English)
  • Figure 5: Confusion Matrices for the Identification of Break Indices: (a) Tamil, (b) English, (c) Hindi
  • ...and 1 more figures