Table of Contents
Fetching ...

Improving Grapheme-to-Phoneme Conversion through In-Context Knowledge Retrieval with Large Language Models

Dongrui Han, Mingyu Cui, Jiawen Kang, Xixin Wu, Xunying Liu, Helen Meng

TL;DR

The paper tackles context-dependent ambiguities in Grapheme-to-Phoneme (G2P) conversion for Text-to-Speech by introducing in-context knowledge retrieval (ICKR) with Large Language Models. It explores both a one-shot GPT-4 prompting approach and a GPT-4-based ICKR workflow that uses a homograph dictionary to disambiguate meanings and generate phonemes, complemented by fine-tuned LLMs using QLoRA. On the Librig2p dataset, the ICKR approach yields improvements over baselines, achieving a highest reported homograph accuracy of 3.5% absolute and a weighted PER reduction of up to 2.0% absolute (28.9% relative); GPT-4-based ICKR further elevates performance to 95.7% accuracy and 4.9% PER in the strongest configuration. The results demonstrate the potential of leveraging LLMs’ linguistic knowledge and context understanding to enhance context-aware G2P for more natural TTS synthesis, while emphasizing the importance of high-quality, customized phoneme dictionaries for best gains.

Abstract

Grapheme-to-phoneme (G2P) conversion is a crucial step in Text-to-Speech (TTS) systems, responsible for mapping grapheme to corresponding phonetic representations. However, it faces ambiguities problems where the same grapheme can represent multiple phonemes depending on contexts, posing a challenge for G2P conversion. Inspired by the remarkable success of Large Language Models (LLMs) in handling context-aware scenarios, contextual G2P conversion systems with LLMs' in-context knowledge retrieval (ICKR) capabilities are proposed to promote disambiguation capability. The efficacy of incorporating ICKR into G2P conversion systems is demonstrated thoroughly on the Librig2p dataset. In particular, the best contextual G2P conversion system using ICKR outperforms the baseline with weighted average phoneme error rate (PER) reductions of 2.0% absolute (28.9% relative). Using GPT-4 in the ICKR system can increase of 3.5% absolute (3.8% relative) on the Librig2p dataset.

Improving Grapheme-to-Phoneme Conversion through In-Context Knowledge Retrieval with Large Language Models

TL;DR

The paper tackles context-dependent ambiguities in Grapheme-to-Phoneme (G2P) conversion for Text-to-Speech by introducing in-context knowledge retrieval (ICKR) with Large Language Models. It explores both a one-shot GPT-4 prompting approach and a GPT-4-based ICKR workflow that uses a homograph dictionary to disambiguate meanings and generate phonemes, complemented by fine-tuned LLMs using QLoRA. On the Librig2p dataset, the ICKR approach yields improvements over baselines, achieving a highest reported homograph accuracy of 3.5% absolute and a weighted PER reduction of up to 2.0% absolute (28.9% relative); GPT-4-based ICKR further elevates performance to 95.7% accuracy and 4.9% PER in the strongest configuration. The results demonstrate the potential of leveraging LLMs’ linguistic knowledge and context understanding to enhance context-aware G2P for more natural TTS synthesis, while emphasizing the importance of high-quality, customized phoneme dictionaries for best gains.

Abstract

Grapheme-to-phoneme (G2P) conversion is a crucial step in Text-to-Speech (TTS) systems, responsible for mapping grapheme to corresponding phonetic representations. However, it faces ambiguities problems where the same grapheme can represent multiple phonemes depending on contexts, posing a challenge for G2P conversion. Inspired by the remarkable success of Large Language Models (LLMs) in handling context-aware scenarios, contextual G2P conversion systems with LLMs' in-context knowledge retrieval (ICKR) capabilities are proposed to promote disambiguation capability. The efficacy of incorporating ICKR into G2P conversion systems is demonstrated thoroughly on the Librig2p dataset. In particular, the best contextual G2P conversion system using ICKR outperforms the baseline with weighted average phoneme error rate (PER) reductions of 2.0% absolute (28.9% relative). Using GPT-4 in the ICKR system can increase of 3.5% absolute (3.8% relative) on the Librig2p dataset.

Paper Structure

This paper contains 12 sections, 4 figures, 4 tables.

Figures (4)

  • Figure 1: LLM In-context Knowledge Retrieval of an example sentence: 1. Input word, and corresponding sentence to the system; 2. Tag each word if it is in the dictionary and is a homograph. 3. if the word is in the dictionary and is a homograph, then let LLM find the case where the usage of the word is closest to the context of the word input; 4 if the word is in the dictionary and is not homograph, then get the recorded phonemes directly; 5. If the word is not in the dictionary, then let LLM generate the phonemes of it considering the context.
  • Figure 2: LLM One-shot prompt. Here is only the prompt without user input.
  • Figure 3: LLM case matching prompt. Here is only the prompt without user input.
  • Figure 4: LLM word phoneme generating. Here is only the prompt without user input.