Table of Contents
Fetching ...

PolyNorm: Few-Shot LLM-Based Text Normalization for Text-to-Speech

Michel Wong, Ali Alshehri, Sophia Kao, Haotian He

TL;DR

PolyNorm presents a scalable, language-agnostic TN framework that leverages few-shot prompting and in-context learning to reduce reliance on hand-crafted rules for text-to-speech. By building a multilingual benchmark (PolyNorm-Benchmark) and adopting a data-curation pipeline, it achieves strong WER/BLEU performance across eight languages, outperforming a production-rule baseline and adapting to domain-specific constructs like URLs and numeric expressions. The approach emphasizes rapid iteration, reduced labeling needs, and broad language coverage, while outlining future work in diacritization, non-whitespace tokenization, and suprasegmental features. This work advances practical, multilingual TN for TTS and provides standardized evaluation data to spur further research in LLM-driven normalization.

Abstract

Text Normalization (TN) is a key preprocessing step in Text-to-Speech (TTS) systems, converting written forms into their canonical spoken equivalents. Traditional TN systems can exhibit high accuracy, but involve substantial engineering effort, are difficult to scale, and pose challenges to language coverage, particularly in low-resource settings. We propose PolyNorm, a prompt-based approach to TN using Large Language Models (LLMs), aiming to reduce the reliance on manually crafted rules and enable broader linguistic applicability with minimal human intervention. Additionally, we present a language-agnostic pipeline for automatic data curation and evaluation, designed to facilitate scalable experimentation across diverse languages. Experiments across eight languages show consistent reductions in the word error rate (WER) compared to a production-grade-based system. To support further research, we release PolyNorm-Benchmark, a multilingual data set covering a diverse range of text normalization phenomena.

PolyNorm: Few-Shot LLM-Based Text Normalization for Text-to-Speech

TL;DR

PolyNorm presents a scalable, language-agnostic TN framework that leverages few-shot prompting and in-context learning to reduce reliance on hand-crafted rules for text-to-speech. By building a multilingual benchmark (PolyNorm-Benchmark) and adopting a data-curation pipeline, it achieves strong WER/BLEU performance across eight languages, outperforming a production-rule baseline and adapting to domain-specific constructs like URLs and numeric expressions. The approach emphasizes rapid iteration, reduced labeling needs, and broad language coverage, while outlining future work in diacritization, non-whitespace tokenization, and suprasegmental features. This work advances practical, multilingual TN for TTS and provides standardized evaluation data to spur further research in LLM-driven normalization.

Abstract

Text Normalization (TN) is a key preprocessing step in Text-to-Speech (TTS) systems, converting written forms into their canonical spoken equivalents. Traditional TN systems can exhibit high accuracy, but involve substantial engineering effort, are difficult to scale, and pose challenges to language coverage, particularly in low-resource settings. We propose PolyNorm, a prompt-based approach to TN using Large Language Models (LLMs), aiming to reduce the reliance on manually crafted rules and enable broader linguistic applicability with minimal human intervention. Additionally, we present a language-agnostic pipeline for automatic data curation and evaluation, designed to facilitate scalable experimentation across diverse languages. Experiments across eight languages show consistent reductions in the word error rate (WER) compared to a production-grade-based system. To support further research, we release PolyNorm-Benchmark, a multilingual data set covering a diverse range of text normalization phenomena.

Paper Structure

This paper contains 21 sections, 1 figure, 6 tables.

Figures (1)

  • Figure 1: The ambiguous token 17° is normalized differently based on context and language.