Table of Contents
Fetching ...

Retrieval-Augmented Self-Taught Reasoning Model with Adaptive Chain-of-Thought for ASR Named Entity Correction

Junjie An, Jingguang Tian, Tianyi Wang, Yu Gao, Xiaofeng Mou, Yi Xu

TL;DR

The paper tackles named entity correction in end-to-end ASR by introducing RA-STAR, a retrieval-augmented framework combining a rephrasing NER model (RLM) with an adaptive self-taught reasoning module (A-STAR). RLM improves NER robustness under ASR noise by leveraging sentence-level semantics rather than strict token alignment, while A-STAR uses self-distillation and adaptive CoT to selectively apply deeper reasoning where needed using Direct Preference Optimization for self-improvement. Phonetic-level retrieval connects detected spans to top-k phonetic candidates, enabling effective corrections in phonetically ambiguous cases. Experiments on AISHELL-1 and a Homophone dataset show notable NE-CER reductions, with larger models benefiting most from adaptive reasoning, and the approach achieving substantial reasoning-cost reductions while maintaining or improving accuracy.

Abstract

End-to-end automatic speech recognition (ASR) systems frequently misrecognize domain-specific phrases like named entities, which can cause catastrophic failures in downstream tasks. A new family of named entity correction methods based on large language models (LLMs) has recently emerged. However, these approaches have yet to fully exploit the sophisticated reasoning capabilities inherent to LLMs. To bridge this gap, we propose a novel retrieval-augmented generation framework for correcting named entity errors in ASR. Our approach consists of two key components: (1) a rephrasing language model (RLM) for named entity recognition, followed by candidate retrieval using a phonetic-level edit distance; and (2) a novel self-taught reasoning model with adaptive chain-of-thought (A-STAR) that dynamically adjusts the depth of its reasoning based on task difficulty. Experiments on the AISHELL-1 and Homophone datasets demonstrate the effectiveness of our method, which achieves relative reductions in the named entity character error rate of 17.96\% and 34.42\%, respectively, compared to a strong baseline.

Retrieval-Augmented Self-Taught Reasoning Model with Adaptive Chain-of-Thought for ASR Named Entity Correction

TL;DR

The paper tackles named entity correction in end-to-end ASR by introducing RA-STAR, a retrieval-augmented framework combining a rephrasing NER model (RLM) with an adaptive self-taught reasoning module (A-STAR). RLM improves NER robustness under ASR noise by leveraging sentence-level semantics rather than strict token alignment, while A-STAR uses self-distillation and adaptive CoT to selectively apply deeper reasoning where needed using Direct Preference Optimization for self-improvement. Phonetic-level retrieval connects detected spans to top-k phonetic candidates, enabling effective corrections in phonetically ambiguous cases. Experiments on AISHELL-1 and a Homophone dataset show notable NE-CER reductions, with larger models benefiting most from adaptive reasoning, and the approach achieving substantial reasoning-cost reductions while maintaining or improving accuracy.

Abstract

End-to-end automatic speech recognition (ASR) systems frequently misrecognize domain-specific phrases like named entities, which can cause catastrophic failures in downstream tasks. A new family of named entity correction methods based on large language models (LLMs) has recently emerged. However, these approaches have yet to fully exploit the sophisticated reasoning capabilities inherent to LLMs. To bridge this gap, we propose a novel retrieval-augmented generation framework for correcting named entity errors in ASR. Our approach consists of two key components: (1) a rephrasing language model (RLM) for named entity recognition, followed by candidate retrieval using a phonetic-level edit distance; and (2) a novel self-taught reasoning model with adaptive chain-of-thought (A-STAR) that dynamically adjusts the depth of its reasoning based on task difficulty. Experiments on the AISHELL-1 and Homophone datasets demonstrate the effectiveness of our method, which achieves relative reductions in the named entity character error rate of 17.96\% and 34.42\%, respectively, compared to a strong baseline.
Paper Structure (13 sections, 4 equations, 1 figure, 3 tables, 1 algorithm)

This paper contains 13 sections, 4 equations, 1 figure, 3 tables, 1 algorithm.

Figures (1)

  • Figure 1: Architecture of the RASTAR. The system processes ASR transcripts through three sequential modules: (1) NER Module identifies entities $\mathcal{E}^{X}$ from ASR transcripts; (2) Retrieval Module retrieves top-$k$ candidate entities from a NEs candidate repository; (3) Correction Module employs a self-taught reasoning mechanism to select the most appropriate candidate for entity replacement.