Table of Contents
Fetching ...

LLM-based Generative Error Correction for Rare Words with Synthetic Data and Phonetic Context

Natsuo Yamashita, Masaaki Yamamoto, Hiroaki Kokubo, Yohei Kawaguchi

TL;DR

This work tackles ASR transcription errors arising from rare words and misalignment due to neglecting phonetics. It introduces a two-pronged approach: generating synthetic data that embeds rare words for GER fine-tuning and incorporating phonetic context through N-best hypotheses and a simplified phoneme representation (LSP) to curb over-correction. Empirical results across English and Japanese datasets show consistent WER/CER reductions and substantial improvements in rare-word recall, with LSP providing additional gains by balancing semantic and phonetic cues. The method offers a scalable path to more robust GER that better preserves spoken intent and pronunciation in diverse domains.

Abstract

Generative error correction (GER) with large language models (LLMs) has emerged as an effective post-processing approach to improve automatic speech recognition (ASR) performance. However, it often struggles with rare or domain-specific words due to limited training data. Furthermore, existing LLM-based GER approaches primarily rely on textual information, neglecting phonetic cues, which leads to over-correction. To address these issues, we propose a novel LLM-based GER approach that targets rare words and incorporates phonetic information. First, we generate synthetic data to contain rare words for fine-tuning the GER model. Second, we integrate ASR's N-best hypotheses along with phonetic context to mitigate over-correction. Experimental results show that our method not only improves the correction of rare words but also reduces the WER and CER across both English and Japanese datasets.

LLM-based Generative Error Correction for Rare Words with Synthetic Data and Phonetic Context

TL;DR

This work tackles ASR transcription errors arising from rare words and misalignment due to neglecting phonetics. It introduces a two-pronged approach: generating synthetic data that embeds rare words for GER fine-tuning and incorporating phonetic context through N-best hypotheses and a simplified phoneme representation (LSP) to curb over-correction. Empirical results across English and Japanese datasets show consistent WER/CER reductions and substantial improvements in rare-word recall, with LSP providing additional gains by balancing semantic and phonetic cues. The method offers a scalable path to more robust GER that better preserves spoken intent and pronunciation in diverse domains.

Abstract

Generative error correction (GER) with large language models (LLMs) has emerged as an effective post-processing approach to improve automatic speech recognition (ASR) performance. However, it often struggles with rare or domain-specific words due to limited training data. Furthermore, existing LLM-based GER approaches primarily rely on textual information, neglecting phonetic cues, which leads to over-correction. To address these issues, we propose a novel LLM-based GER approach that targets rare words and incorporates phonetic information. First, we generate synthetic data to contain rare words for fine-tuning the GER model. Second, we integrate ASR's N-best hypotheses along with phonetic context to mitigate over-correction. Experimental results show that our method not only improves the correction of rare words but also reduces the WER and CER across both English and Japanese datasets.

Paper Structure

This paper contains 18 sections, 2 figures, 4 tables.

Figures (2)

  • Figure 1: Overview of the proposed methods of synthetic data generation from rare words and GER with phonetic context.
  • Figure 2: F1 scores of rare words with different numbers of transcripts and speakers in the Medtxt dataset using ChatGPT.