Table of Contents
Fetching ...

Data-driven grapheme-to-phoneme representations for a lexicon-free text-to-speech

Abhinav Garg, Jiyeon Kim, Sushil Khyalia, Chanwoo Kim, Dhananjaya Gowda

TL;DR

The paper tackles the reliance on hand-crafted lexicons and fixed phoneme sets in G2P by proposing a data-driven, lexicon-free approach that derives phoneme representations from unlabeled speech via self-supervised learning. It leverages HuBERT acoustic units and k-means clustering to create frame-level phoneme targets, trains a Transformer-based G2P on these targets, and finally trains Tacotron 2 for TTS without a linguistic lexicon. Results show MOS comparable to or better than lexicon-based methods, with high GMOS when using HuBERT-derived targets, demonstrating strong performance without linguistic expertise and offering a scalable path for low-resource languages. Overall, the method reduces the dependence on expert lexicons while delivering high-quality TTS, highlighting practical impact for broad language coverage.

Abstract

Grapheme-to-Phoneme (G2P) is an essential first step in any modern, high-quality Text-to-Speech (TTS) system. Most of the current G2P systems rely on carefully hand-crafted lexicons developed by experts. This poses a two-fold problem. Firstly, the lexicons are generated using a fixed phoneme set, usually, ARPABET or IPA, which might not be the most optimal way to represent phonemes for all languages. Secondly, the man-hours required to produce such an expert lexicon are very high. In this paper, we eliminate both of these issues by using recent advances in self-supervised learning to obtain data-driven phoneme representations instead of fixed representations. We compare our lexicon-free approach against strong baselines that utilize a well-crafted lexicon. Furthermore, we show that our data-driven lexicon-free method performs as good or even marginally better than the conventional rule-based or lexicon-based neural G2Ps in terms of Mean Opinion Score (MOS) while using no prior language lexicon or phoneme set, i.e. no linguistic expertise.

Data-driven grapheme-to-phoneme representations for a lexicon-free text-to-speech

TL;DR

The paper tackles the reliance on hand-crafted lexicons and fixed phoneme sets in G2P by proposing a data-driven, lexicon-free approach that derives phoneme representations from unlabeled speech via self-supervised learning. It leverages HuBERT acoustic units and k-means clustering to create frame-level phoneme targets, trains a Transformer-based G2P on these targets, and finally trains Tacotron 2 for TTS without a linguistic lexicon. Results show MOS comparable to or better than lexicon-based methods, with high GMOS when using HuBERT-derived targets, demonstrating strong performance without linguistic expertise and offering a scalable path for low-resource languages. Overall, the method reduces the dependence on expert lexicons while delivering high-quality TTS, highlighting practical impact for broad language coverage.

Abstract

Grapheme-to-Phoneme (G2P) is an essential first step in any modern, high-quality Text-to-Speech (TTS) system. Most of the current G2P systems rely on carefully hand-crafted lexicons developed by experts. This poses a two-fold problem. Firstly, the lexicons are generated using a fixed phoneme set, usually, ARPABET or IPA, which might not be the most optimal way to represent phonemes for all languages. Secondly, the man-hours required to produce such an expert lexicon are very high. In this paper, we eliminate both of these issues by using recent advances in self-supervised learning to obtain data-driven phoneme representations instead of fixed representations. We compare our lexicon-free approach against strong baselines that utilize a well-crafted lexicon. Furthermore, we show that our data-driven lexicon-free method performs as good or even marginally better than the conventional rule-based or lexicon-based neural G2Ps in terms of Mean Opinion Score (MOS) while using no prior language lexicon or phoneme set, i.e. no linguistic expertise.
Paper Structure (15 sections, 1 equation, 1 figure, 3 tables)