Table of Contents
Fetching ...

Error Correction by Paying Attention to Both Acoustic and Confidence References for Automatic Speech Recognition

Yuchun Shu, Bo Hu, Yifeng He, Hao Shi, Longbiao Wang, Jianwu Dang

TL;DR

The paper addresses ASR error correction by proposing a non-autoregressive framework that leverages both an acoustic reference from the ASR encoder and per-token confidence estimates to locate and repair errors. It introduces a Confidence Module (CEM) and an N-best fusion strategy to integrate multiple ASR hypotheses, using cross-attention to fuse references with hypothesis embeddings. Empirical results on AISHELL-1 show substantial CER reductions (up to ~21%) compared with the ASR baseline, while maintaining fast, non-autoregressive decoding. The approach demonstrates the complementary benefits of acoustic and confidence references and supports practical deployment through low latency and robust error correction. Future work aims to handle variable-length corrections for deletions more effectively.

Abstract

Accurately finding the wrong words in the automatic speech recognition (ASR) hypothesis and recovering them well-founded is the goal of speech error correction. In this paper, we propose a non-autoregressive speech error correction method. A Confidence Module measures the uncertainty of each word of the N-best ASR hypotheses as the reference to find the wrong word position. Besides, the acoustic feature from the ASR encoder is also used to provide the correct pronunciation references. N-best candidates from ASR are aligned using the edit path, to confirm each other and recover some missing character errors. Furthermore, the cross-attention mechanism fuses the information between error correction references and the ASR hypothesis. The experimental results show that both the acoustic and confidence references help with error correction. The proposed system reduces the error rate by 21% compared with the ASR model.

Error Correction by Paying Attention to Both Acoustic and Confidence References for Automatic Speech Recognition

TL;DR

The paper addresses ASR error correction by proposing a non-autoregressive framework that leverages both an acoustic reference from the ASR encoder and per-token confidence estimates to locate and repair errors. It introduces a Confidence Module (CEM) and an N-best fusion strategy to integrate multiple ASR hypotheses, using cross-attention to fuse references with hypothesis embeddings. Empirical results on AISHELL-1 show substantial CER reductions (up to ~21%) compared with the ASR baseline, while maintaining fast, non-autoregressive decoding. The approach demonstrates the complementary benefits of acoustic and confidence references and supports practical deployment through low latency and robust error correction. Future work aims to handle variable-length corrections for deletions more effectively.

Abstract

Accurately finding the wrong words in the automatic speech recognition (ASR) hypothesis and recovering them well-founded is the goal of speech error correction. In this paper, we propose a non-autoregressive speech error correction method. A Confidence Module measures the uncertainty of each word of the N-best ASR hypotheses as the reference to find the wrong word position. Besides, the acoustic feature from the ASR encoder is also used to provide the correct pronunciation references. N-best candidates from ASR are aligned using the edit path, to confirm each other and recover some missing character errors. Furthermore, the cross-attention mechanism fuses the information between error correction references and the ASR hypothesis. The experimental results show that both the acoustic and confidence references help with error correction. The proposed system reduces the error rate by 21% compared with the ASR model.
Paper Structure (15 sections, 3 equations, 4 figures, 2 tables)

This paper contains 15 sections, 3 equations, 4 figures, 2 tables.

Figures (4)

  • Figure 1: Flowchart of proposed speech error correction model.
  • Figure 2: Training of Confidence Module.
  • Figure 3: Flowchart of the N-best Fusion Module.
  • Figure 4: An example of the error correction process of the ASR hypothesis.