Improving Grapheme-to-Phoneme Conversion through In-Context Knowledge Retrieval with Large Language Models

Dongrui Han; Mingyu Cui; Jiawen Kang; Xixin Wu; Xunying Liu; Helen Meng

Improving Grapheme-to-Phoneme Conversion through In-Context Knowledge Retrieval with Large Language Models

Dongrui Han, Mingyu Cui, Jiawen Kang, Xixin Wu, Xunying Liu, Helen Meng

TL;DR

The paper tackles context-dependent ambiguities in Grapheme-to-Phoneme (G2P) conversion for Text-to-Speech by introducing in-context knowledge retrieval (ICKR) with Large Language Models. It explores both a one-shot GPT-4 prompting approach and a GPT-4-based ICKR workflow that uses a homograph dictionary to disambiguate meanings and generate phonemes, complemented by fine-tuned LLMs using QLoRA. On the Librig2p dataset, the ICKR approach yields improvements over baselines, achieving a highest reported homograph accuracy of 3.5% absolute and a weighted PER reduction of up to 2.0% absolute (28.9% relative); GPT-4-based ICKR further elevates performance to 95.7% accuracy and 4.9% PER in the strongest configuration. The results demonstrate the potential of leveraging LLMs’ linguistic knowledge and context understanding to enhance context-aware G2P for more natural TTS synthesis, while emphasizing the importance of high-quality, customized phoneme dictionaries for best gains.

Abstract

Grapheme-to-phoneme (G2P) conversion is a crucial step in Text-to-Speech (TTS) systems, responsible for mapping grapheme to corresponding phonetic representations. However, it faces ambiguities problems where the same grapheme can represent multiple phonemes depending on contexts, posing a challenge for G2P conversion. Inspired by the remarkable success of Large Language Models (LLMs) in handling context-aware scenarios, contextual G2P conversion systems with LLMs' in-context knowledge retrieval (ICKR) capabilities are proposed to promote disambiguation capability. The efficacy of incorporating ICKR into G2P conversion systems is demonstrated thoroughly on the Librig2p dataset. In particular, the best contextual G2P conversion system using ICKR outperforms the baseline with weighted average phoneme error rate (PER) reductions of 2.0% absolute (28.9% relative). Using GPT-4 in the ICKR system can increase of 3.5% absolute (3.8% relative) on the Librig2p dataset.

Improving Grapheme-to-Phoneme Conversion through In-Context Knowledge Retrieval with Large Language Models

TL;DR

Abstract

Improving Grapheme-to-Phoneme Conversion through In-Context Knowledge Retrieval with Large Language Models

TL;DR

Abstract

Paper Structure

Table of Contents

Figures (4)