Synthetic Data Domain Adaptation for ASR via LLM-based Text and Phonetic Respelling Augmentation

Natsuo Yamashita; Koichi Nagatsuka; Hiroaki Kokubo; Kota Dohi; Tuan Vu Ho

Synthetic Data Domain Adaptation for ASR via LLM-based Text and Phonetic Respelling Augmentation

Natsuo Yamashita, Koichi Nagatsuka, Hiroaki Kokubo, Kota Dohi, Tuan Vu Ho

Abstract

End-to-end automatic speech recognition often degrades on domain-specific data due to scarce in-domain resources. We propose a synthetic-data-based domain adaptation framework with two contributions: (1) a large language model (LLM)-based text augmentation pipeline with a filtering strategy that balances lexical diversity, perplexity, and domain-term coverage, and (2) phonetic respelling augmentation (PRA), a novel method that introduces pronunciation variability through LLM-generated orthographic pseudo-spellings. Unlike conventional acoustic-level methods such as SpecAugment, PRA provides phonetic diversity before speech synthesis, enabling synthetic speech to better approximate real-world variability. Experimental results across four domain-specific datasets demonstrate consistent reductions in word error rate, confirming that combining domain-specific lexical coverage with realistic pronunciation variation significantly improves ASR robustness.

Synthetic Data Domain Adaptation for ASR via LLM-based Text and Phonetic Respelling Augmentation

Abstract

Paper Structure (13 sections, 5 figures, 4 tables)

This paper contains 13 sections, 5 figures, 4 tables.

Introduction
Related work
Text filtering for ASR domain adaptation
Phonological Tasks and LLMs
Proposed method
LLM-based text augmentation pipeline
Phonetic respelling augmentation
Experiments
Experimental setup
Preliminary results of text augmentation pipeline
Main results of ASR performance
Ablation Study
conclusion

Figures (5)

Figure 1: Proposed methods overview. Italics denote placeholders.
Figure 1: Summary of evaluation subsets and generated data. Term Ratio is the proportion of test words that are domain-specific terms.
Figure 2: WER with varied weight ratio and filtered duration in P1-1 on ATCO2.
Figure 3: ASR results for text augmentation methods (WER / B-WER / U-WER). "P" indicates proposed methods; "B" indicates baselines.
Figure 4: WER of P1-1 with SpecAugment and PRA.

Synthetic Data Domain Adaptation for ASR via LLM-based Text and Phonetic Respelling Augmentation

Abstract

Synthetic Data Domain Adaptation for ASR via LLM-based Text and Phonetic Respelling Augmentation

Authors

Abstract

Table of Contents

Figures (5)