Table of Contents
Fetching ...

CTC-Assisted LLM-Based Contextual ASR

Guanrou Yang, Ziyang Ma, Zhifu Gao, Shiliang Zhang, Xie Chen

TL;DR

This work proposes a CTC-Assessment-Based LLM-Based Contextual ASR model with an efficient filtering algorithm, demonstrating significant improvements compared to the baseline LLM-based ASR model, and substantially surpassing other related work.

Abstract

Contextual ASR or hotword customization holds substantial practical value. Despite the impressive performance of current end-to-end (E2E) automatic speech recognition (ASR) systems, they often face challenges in accurately recognizing rare words. Typical E2E contextual ASR models commonly feature complex architectures and decoding mechanisms, limited in performance and susceptible to interference from distractor words. With large language model (LLM)-based ASR models emerging as the new mainstream, we propose a CTC-Assisted LLM-Based Contextual ASR model with an efficient filtering algorithm. By using coarse CTC decoding results to filter potential relevant hotwords and incorporating them into LLM prompt input, our model attains WER/B-WER of 1.27%/3.67% and 2.72%/8.02% on the Librispeech test-clean and test-other sets targeting on recognizing rare long-tail words, demonstrating significant improvements compared to the baseline LLM-based ASR model, and substantially surpassing other related work. More remarkably, with the help of the large language model and proposed filtering algorithm, our contextual ASR model still performs well with 2000 biasing words.

CTC-Assisted LLM-Based Contextual ASR

TL;DR

This work proposes a CTC-Assessment-Based LLM-Based Contextual ASR model with an efficient filtering algorithm, demonstrating significant improvements compared to the baseline LLM-based ASR model, and substantially surpassing other related work.

Abstract

Contextual ASR or hotword customization holds substantial practical value. Despite the impressive performance of current end-to-end (E2E) automatic speech recognition (ASR) systems, they often face challenges in accurately recognizing rare words. Typical E2E contextual ASR models commonly feature complex architectures and decoding mechanisms, limited in performance and susceptible to interference from distractor words. With large language model (LLM)-based ASR models emerging as the new mainstream, we propose a CTC-Assisted LLM-Based Contextual ASR model with an efficient filtering algorithm. By using coarse CTC decoding results to filter potential relevant hotwords and incorporating them into LLM prompt input, our model attains WER/B-WER of 1.27%/3.67% and 2.72%/8.02% on the Librispeech test-clean and test-other sets targeting on recognizing rare long-tail words, demonstrating significant improvements compared to the baseline LLM-based ASR model, and substantially surpassing other related work. More remarkably, with the help of the large language model and proposed filtering algorithm, our contextual ASR model still performs well with 2000 biasing words.

Paper Structure

This paper contains 12 sections, 1 figure, 4 tables.

Figures (1)

  • Figure 1: (a) illustrates the proposed Contextual LLM-based ASR model architecture. (b) and (c) explain the generation process of hotwords to be included in the prompt during training and inference phases respectively.