A Chat About Boring Problems: Studying GPT-based text normalization

Yang Zhang; Travis M. Bartley; Mariana Graterol-Fuenmayor; Vitaly Lavrukhin; Evelina Bakhturina; Boris Ginsburg

A Chat About Boring Problems: Studying GPT-based text normalization

Yang Zhang, Travis M. Bartley, Mariana Graterol-Fuenmayor, Vitaly Lavrukhin, Evelina Bakhturina, Boris Ginsburg

TL;DR

The paper demonstrates that text normalization, a context-sensitive preprocessor for TTS, can be effectively performed by LLMs in few-shot settings, challenging the view that TN is unsuitable for neural models. By introducing a six-category error taxonomy focused on felicity and unrecoverable errors, and evaluating GPT-3.5-Turbo and GPT-4.0 against a WFST baseline, the authors show that GPT-based TN achieves roughly a $40\%$ reduction in errors compared to Kestrel with near-zero unrecoverable outputs for GPT-4.0. The study emphasizes the importance of self-consistency and comprehensive domain coverage, and reveals that improvements saturate beyond about $2000$ tokens of context. Overall, the findings support the viability of GPT-based TN and point to future hybrid approaches that combine LLM capabilities with domain-specific constraints to further improve reliability and reduce remaining errors.

Abstract

Text normalization - the conversion of text from written to spoken form - is traditionally assumed to be an ill-formed task for language models. In this work, we argue otherwise. We empirically show the capacity of Large-Language Models (LLM) for text normalization in few-shot scenarios. Combining self-consistency reasoning with linguistic-informed prompt engineering, we find LLM based text normalization to achieve error rates around 40\% lower than top normalization systems. Further, upon error analysis, we note key limitations in the conventional design of text normalization tasks. We create a new taxonomy of text normalization errors and apply it to results from GPT-3.5-Turbo and GPT-4.0. Through this new framework, we can identify strengths and weaknesses of GPT-based TN, opening opportunities for future work.

A Chat About Boring Problems: Studying GPT-based text normalization

TL;DR

reduction in errors compared to Kestrel with near-zero unrecoverable outputs for GPT-4.0. The study emphasizes the importance of self-consistency and comprehensive domain coverage, and reveals that improvements saturate beyond about

tokens of context. Overall, the findings support the viability of GPT-based TN and point to future hybrid approaches that combine LLM capabilities with domain-specific constraints to further improve reliability and reduce remaining errors.

Abstract

Paper Structure (10 sections, 2 figures, 3 tables)

This paper contains 10 sections, 2 figures, 3 tables.

Introduction
Method
Error Taxonomy
Experiments
Dataset
Results
Sampling Experiments
Final Performance
Discussion
Conclusion

Figures (2)

Figure 1: Example of text normalization across semiotic classes. The normalization of the string ("1/4") varies given semantic context.
Figure 2: GPT normalization and evaluation pipeline.

A Chat About Boring Problems: Studying GPT-based text normalization

TL;DR

Abstract

A Chat About Boring Problems: Studying GPT-based text normalization

Authors

TL;DR

Abstract

Table of Contents

Figures (2)