LLM-based Generative Error Correction for Rare Words with Synthetic Data and Phonetic Context
Natsuo Yamashita, Masaaki Yamamoto, Hiroaki Kokubo, Yohei Kawaguchi
TL;DR
This work tackles ASR transcription errors arising from rare words and misalignment due to neglecting phonetics. It introduces a two-pronged approach: generating synthetic data that embeds rare words for GER fine-tuning and incorporating phonetic context through N-best hypotheses and a simplified phoneme representation (LSP) to curb over-correction. Empirical results across English and Japanese datasets show consistent WER/CER reductions and substantial improvements in rare-word recall, with LSP providing additional gains by balancing semantic and phonetic cues. The method offers a scalable path to more robust GER that better preserves spoken intent and pronunciation in diverse domains.
Abstract
Generative error correction (GER) with large language models (LLMs) has emerged as an effective post-processing approach to improve automatic speech recognition (ASR) performance. However, it often struggles with rare or domain-specific words due to limited training data. Furthermore, existing LLM-based GER approaches primarily rely on textual information, neglecting phonetic cues, which leads to over-correction. To address these issues, we propose a novel LLM-based GER approach that targets rare words and incorporates phonetic information. First, we generate synthetic data to contain rare words for fine-tuning the GER model. Second, we integrate ASR's N-best hypotheses along with phonetic context to mitigate over-correction. Experimental results show that our method not only improves the correction of rare words but also reduces the WER and CER across both English and Japanese datasets.
