Benchmarking Japanese Speech Recognition on ASR-LLM Setups with Multi-Pass Augmented Generative Error Correction

Yuka Ko; Sheng Li; Chao-Han Huck Yang; Tatsuya Kawahara

Benchmarking Japanese Speech Recognition on ASR-LLM Setups with Multi-Pass Augmented Generative Error Correction

Yuka Ko, Sheng Li, Chao-Han Huck Yang, Tatsuya Kawahara

TL;DR

This work addresses improving Japanese ASR transcripts through generative error correction (GER) using large language models (LLMs). It introduces a multi-pass augmented GER (MPA GER) that combines multiple ASR hypotheses with outputs from several LLMs and merges them, leveraging ROVER-like voting to mitigate hallucinations. The approach is evaluated on SPREDS-U1-ja and CSJ using Elyza-7b and Qwen1.5-7b, demonstrating CER improvements over standard LLM GER and traditional combination methods, with notable gains for short utterances. The findings highlight the value of cross-model diversity in post-editing ASR outputs and point to broader applicability of LLM-based GER in low-cer, high-accuracy regimes for Japanese and potentially other languages.

Abstract

With the strong representational power of large language models (LLMs), generative error correction (GER) for automatic speech recognition (ASR) aims to provide semantic and phonetic refinements to address ASR errors. This work explores how LLM-based GER can enhance and expand the capabilities of Japanese language processing, presenting the first GER benchmark for Japanese ASR with 0.9-2.6k text utterances. We also introduce a new multi-pass augmented generative error correction (MPA GER) by integrating multiple system hypotheses on the input side with corrections from multiple LLMs on the output side and then merging them. To the best of our knowledge, this is the first investigation of the use of LLMs for Japanese GER, which involves second-pass language modeling on the output transcriptions generated by the ASR system (e.g., N-best hypotheses). Our experiments demonstrated performance improvement in the proposed methods of ASR quality and generalization both in SPREDS-U1-ja and CSJ data.

Benchmarking Japanese Speech Recognition on ASR-LLM Setups with Multi-Pass Augmented Generative Error Correction

TL;DR

Abstract

Paper Structure (16 sections, 4 figures, 7 tables)

This paper contains 16 sections, 4 figures, 7 tables.

Introduction
Related Work
LMs for ASR task
Improving generative error correction using LLMs for ASR system
LLM GER with $N$-best hypotheses in a single system
Multi-pass augmented (MPA) GER
Experimental Setup
Experimental Results and Analyses
SPREDS-U1-ja: LLM GER in $N$-best hypotheses and $N$-system combination
CSJ Results
LLM GER and MPA GER in 1-best $N$-systems
LLM GER and MPA GER with $N$-best hypotheses in each single system
The GER trends from output examples
Proposed MPA GER alleviates hallucinations or repetitions
Sentence length influences performance
...and 1 more sections

Figures (4)

Figure 1: The standard LLM GER method rescoring $N$-best hypotheses.
Figure 2: The proposed multi-pass augmented (MPA) GER method combines hypotheses from different ASR and LLM models.
Figure 3: Utterance count by reference length in Eval$_{\{1,2,3\}}$
Figure 4: CER [%] in each reference length in Eval$_3$

Benchmarking Japanese Speech Recognition on ASR-LLM Setups with Multi-Pass Augmented Generative Error Correction

TL;DR

Abstract

Benchmarking Japanese Speech Recognition on ASR-LLM Setups with Multi-Pass Augmented Generative Error Correction

Authors

TL;DR

Abstract

Table of Contents

Figures (4)