Phonetically rich corpus construction for a low-resourced language

Marcellus Amadeus; William Alberto Cruz Castañeda; Wilmer Lobato; Niasche Aquino

Phonetically rich corpus construction for a low-resourced language

Marcellus Amadeus, William Alberto Cruz Castañeda, Wilmer Lobato, Niasche Aquino

TL;DR

This work tackles the challenge of creating a phonetically rich corpus for Brazilian Portuguese, a low-resource language, to improve acoustic modeling and TTS/ASR applications. It introduces a four-stage methodology—linguistic-variable establishment, diverse corpus collection, textual processing, and recording protocols—built around a phoneme-to-triphone framework with acoustic-articulatory (vocoid/contoid) classification. The approach yields a corpus (Alana AI) with substantially higher distinct triphone coverage than CETUC, TTS-Portuguese, and Globo, achieving a $55.8\%$ increase in distinct triphones at comparable sizes and saturation around $5{,}000$ sentences when maximizing linguistic diversity. The work provides a concrete, replicable workflow for constructing phonetically rich data in low-resource settings, with clear implications for improving the naturalness and accuracy of PT-BR speech technologies across ASR and TTS tasks.

Abstract

Speech technologies rely on capturing a speaker's voice variability while obtaining comprehensive language information. Textual prompts and sentence selection methods have been proposed in the literature to comprise such adequate phonetic data, referred to as a phonetically rich \textit{corpus}. However, they are still insufficient for acoustic modeling, especially critical for languages with limited resources. Hence, this paper proposes a novel approach and outlines the methodological aspects required to create a \textit{corpus} with broad phonetic coverage for a low-resourced language, Brazilian Portuguese. Our methodology includes text dataset collection up to a sentence selection algorithm based on triphone distribution. Furthermore, we propose a new phonemic classification according to acoustic-articulatory speech features since the absolute number of distinct triphones, or low-probability triphones, does not guarantee an adequate representation of every possible combination. Using our algorithm, we achieve a 55.8\% higher percentage of distinct triphones -- for samples of similar size -- while the currently available phonetic-rich corpus, CETUC and TTS-Portuguese, 12.6\% and 12.3\% in comparison to a non-phonetically rich dataset.

Phonetically rich corpus construction for a low-resourced language

TL;DR

increase in distinct triphones at comparable sizes and saturation around

sentences when maximizing linguistic diversity. The work provides a concrete, replicable workflow for constructing phonetically rich data in low-resource settings, with clear implications for improving the naturalness and accuracy of PT-BR speech technologies across ASR and TTS tasks.

Abstract

Paper Structure (14 sections, 5 figures, 6 tables)

This paper contains 14 sections, 5 figures, 6 tables.

Introduction
Initial guidelines
Methodology
Establishing linguistic variables
Textual corpora selection
Exclusion criteria
Textual processing
Corpora size selection
Sentences selection
Protocols: speakers and recordings
Discussion and results
Conclusion
Acknowledgements
Bibliographical References

Figures (5)

Figure 1: Methodology for building a phonetically rich text corpus for PT-BR.
Figure 2: Phonemic histogram of the CETUC corpus text by considering a vocoid-contoid triphone classification.
Figure 3: Phonemic histogram of the CETUC corpus text by considering a vocoid-contoid triphone classification.
Figure 4: Changepoint analysis for the variance of new triphones per sentence.
Figure 5: Corpora new triphones per sentence.

Phonetically rich corpus construction for a low-resourced language

TL;DR

Abstract

Phonetically rich corpus construction for a low-resourced language

Authors

TL;DR

Abstract

Table of Contents

Figures (5)