Table of Contents
Fetching ...

Can Generative Large Language Models Perform ASR Error Correction?

Rao Ma, Mengjie Qian, Potsawee Manakul, Mark Gales, Kate Knill

TL;DR

The paper investigates using ChatGPT to perform ASR error correction in zero-shot and one-shot regimes, leveraging ASR N-best outputs for both unconstrained and constrained decoding. It contrasts unconstrained generation with N-best constrained approaches (selective and closest mapping) across two state-of-the-art ASR architectures and several datasets. Results show that ChatGPT can yield performance gains, particularly with 1-shot prompts and constrained mappings, and can be competitive with T5-based error correction in some settings. The work demonstrates a practical, training-free plug-in approach to ASR post-processing, with notable effectiveness in out-of-domain scenarios but variable results depending on the ASR system and data.

Abstract

ASR error correction is an interesting option for post processing speech recognition system outputs. These error correction models are usually trained in a supervised fashion using the decoding results of a target ASR system. This approach can be computationally intensive and the model is tuned to a specific ASR system. Recently generative large language models (LLMs) have been applied to a wide range of natural language processing tasks, as they can operate in a zero-shot or few shot fashion. In this paper we investigate using ChatGPT, a generative LLM, for ASR error correction. Based on the ASR N-best output, we propose both unconstrained and constrained, where a member of the N-best list is selected, approaches. Additionally, zero and 1-shot settings are evaluated. Experiments show that this generative LLM approach can yield performance gains for two different state-of-the-art ASR architectures, transducer and attention-encoder-decoder based, and multiple test sets.

Can Generative Large Language Models Perform ASR Error Correction?

TL;DR

The paper investigates using ChatGPT to perform ASR error correction in zero-shot and one-shot regimes, leveraging ASR N-best outputs for both unconstrained and constrained decoding. It contrasts unconstrained generation with N-best constrained approaches (selective and closest mapping) across two state-of-the-art ASR architectures and several datasets. Results show that ChatGPT can yield performance gains, particularly with 1-shot prompts and constrained mappings, and can be competitive with T5-based error correction in some settings. The work demonstrates a practical, training-free plug-in approach to ASR post-processing, with notable effectiveness in out-of-domain scenarios but variable results depending on the ASR system and data.

Abstract

ASR error correction is an interesting option for post processing speech recognition system outputs. These error correction models are usually trained in a supervised fashion using the decoding results of a target ASR system. This approach can be computationally intensive and the model is tuned to a specific ASR system. Recently generative large language models (LLMs) have been applied to a wide range of natural language processing tasks, as they can operate in a zero-shot or few shot fashion. In this paper we investigate using ChatGPT, a generative LLM, for ASR error correction. Based on the ASR N-best output, we propose both unconstrained and constrained, where a member of the N-best list is selected, approaches. Additionally, zero and 1-shot settings are evaluated. Experiments show that this generative LLM approach can yield performance gains for two different state-of-the-art ASR architectures, transducer and attention-encoder-decoder based, and multiple test sets.
Paper Structure (14 sections, 3 figures, 6 tables)

This paper contains 14 sections, 3 figures, 6 tables.

Figures (3)

  • Figure 1: N-best T5 error correction model structure.
  • Figure 2: Prompt design for (a) zero-shot unconstrained error correction, (b) zero-shot selective approach, and (c) 1-shot unconstrained error correction. Here we use a 3-best list generated by the ASR system as input to ChatGPT for illustration.
  • Figure 3: Baseline WER of Transducer and error correction results with 1-shot closest. The LibriSpeech test set is split into 5 parts according to the number of the closest hypothesis.