CC-G2PnP: Streaming Grapheme-to-Phoneme and prosody with Conformer-CTC for unsegmented languages
Yuma Shirahata, Ryuichi Yamamoto
TL;DR
CC-G2PnP addresses streaming grapheme-to-phoneme and prosody prediction for unsegmented languages by introducing a Conformer-CTC-based model that operates in a streaming fashion with chunk-aware processing and a minimum look-ahead mechanism. The method learns grapheme-phoneme-prosody alignments via CTC without relying on explicit word boundaries, enabling applicability to languages like Japanese. Experimental results on Japanese data show substantial gains in G2PnP accuracy and TTS naturalness over a baseline streaming model, approaching non-streaming performance in some metrics. The work demonstrates a practical pathway to integrating LLMs with TTS under streaming constraints, though it relies on large-scale training data and suggests future integration of dictionaries or LLM-based knowledge to further enhance performance.
Abstract
We propose CC-G2PnP, a streaming grapheme-to-phoneme and prosody (G2PnP) model to connect large language model and text-to-speech in a streaming manner. CC-G2PnP is based on Conformer-CTC architecture. Specifically, the input grapheme tokens are processed chunk by chunk, which enables streaming inference of phonemic and prosodic (PnP) labels. By guaranteeing minimal look-ahead size to each input token, the proposed model can consider future context in each token, which leads to stable PnP label prediction. Unlike previous streaming methods that depend on explicit word boundaries, the CTC decoder in CC-G2PnP effectively learns the alignment between graphemes and phonemes during training, making it applicable to unsegmented languages. Experiments on a Japanese dataset, which has no explicit word boundaries, show that CC-G2PnP significantly outperforms the baseline streaming G2PnP model in the accuracy of PnP label prediction.
