Table of Contents
Fetching ...

CC-G2PnP: Streaming Grapheme-to-Phoneme and prosody with Conformer-CTC for unsegmented languages

Yuma Shirahata, Ryuichi Yamamoto

TL;DR

CC-G2PnP addresses streaming grapheme-to-phoneme and prosody prediction for unsegmented languages by introducing a Conformer-CTC-based model that operates in a streaming fashion with chunk-aware processing and a minimum look-ahead mechanism. The method learns grapheme-phoneme-prosody alignments via CTC without relying on explicit word boundaries, enabling applicability to languages like Japanese. Experimental results on Japanese data show substantial gains in G2PnP accuracy and TTS naturalness over a baseline streaming model, approaching non-streaming performance in some metrics. The work demonstrates a practical pathway to integrating LLMs with TTS under streaming constraints, though it relies on large-scale training data and suggests future integration of dictionaries or LLM-based knowledge to further enhance performance.

Abstract

We propose CC-G2PnP, a streaming grapheme-to-phoneme and prosody (G2PnP) model to connect large language model and text-to-speech in a streaming manner. CC-G2PnP is based on Conformer-CTC architecture. Specifically, the input grapheme tokens are processed chunk by chunk, which enables streaming inference of phonemic and prosodic (PnP) labels. By guaranteeing minimal look-ahead size to each input token, the proposed model can consider future context in each token, which leads to stable PnP label prediction. Unlike previous streaming methods that depend on explicit word boundaries, the CTC decoder in CC-G2PnP effectively learns the alignment between graphemes and phonemes during training, making it applicable to unsegmented languages. Experiments on a Japanese dataset, which has no explicit word boundaries, show that CC-G2PnP significantly outperforms the baseline streaming G2PnP model in the accuracy of PnP label prediction.

CC-G2PnP: Streaming Grapheme-to-Phoneme and prosody with Conformer-CTC for unsegmented languages

TL;DR

CC-G2PnP addresses streaming grapheme-to-phoneme and prosody prediction for unsegmented languages by introducing a Conformer-CTC-based model that operates in a streaming fashion with chunk-aware processing and a minimum look-ahead mechanism. The method learns grapheme-phoneme-prosody alignments via CTC without relying on explicit word boundaries, enabling applicability to languages like Japanese. Experimental results on Japanese data show substantial gains in G2PnP accuracy and TTS naturalness over a baseline streaming model, approaching non-streaming performance in some metrics. The work demonstrates a practical pathway to integrating LLMs with TTS under streaming constraints, though it relies on large-scale training data and suggests future integration of dictionaries or LLM-based knowledge to further enhance performance.

Abstract

We propose CC-G2PnP, a streaming grapheme-to-phoneme and prosody (G2PnP) model to connect large language model and text-to-speech in a streaming manner. CC-G2PnP is based on Conformer-CTC architecture. Specifically, the input grapheme tokens are processed chunk by chunk, which enables streaming inference of phonemic and prosodic (PnP) labels. By guaranteeing minimal look-ahead size to each input token, the proposed model can consider future context in each token, which leads to stable PnP label prediction. Unlike previous streaming methods that depend on explicit word boundaries, the CTC decoder in CC-G2PnP effectively learns the alignment between graphemes and phonemes during training, making it applicable to unsegmented languages. Experiments on a Japanese dataset, which has no explicit word boundaries, show that CC-G2PnP significantly outperforms the baseline streaming G2PnP model in the accuracy of PnP label prediction.
Paper Structure (15 sections, 3 figures, 3 tables)

This paper contains 15 sections, 3 figures, 3 tables.

Figures (3)

  • Figure 1: Proposed model architecture. The model takes grapheme tokens as input and predicts a mixed sequence of phoneme and prosodic symbols.
  • Figure 2: Chunk-aware streaming. The chunk size $C=3$ and the past context size $P=3$. The green token can attend to pink tokens, which correspond to the tokens within its chunk and the past $P$ context.
  • Figure 3: minimum look-ahead (MLA). MLA allows the first layer of self-attention to reference future tokens outside the current chunk, thereby ensuring that all tokens have at least one token of look-ahead. MLA size $M=1$.