Table of Contents
Fetching ...

A Transcription Prompt-based Efficient Audio Large Language Model for Robust Speech Recognition

Yangze Li, Xiong Wang, Songjun Cao, Yike Zhang, Long Ma, Lei Xie

TL;DR

This work tackles robustness gaps in audio-LLMs for speech recognition by introducing a transcription prompt-based framework that fuses an ASR expert transcription tokenizer with a decoder-only LLM. A two-stage training pipeline trains a transcription tokenizer and then the speech encoder plus adapter while freezing the LLM, with a loss that alternates between prompting and non-prompt objectives via a lambda gate. Decoding combines autoregressive and non-autoregressive strategies through a Hybrid AR NAR approach, using a repetition-detection threshold to switch modes and drastically reduce repetition while speeding up inference. Evaluations on 10k-hour WenetSpeech Mandarin show CER reductions around 12% on Test_Net and 9.6% on Test_Meeting relative to baselines, with zero sentence-level repetition and demonstrated robustness to domain shifts (AISHELL-1). Overall, the method offers a practical path to robust, efficient speech recognition in audio-LLMs by leveraging transcription prompts and hybrid decoding.

Abstract

Audio-LLM introduces audio modality into a large language model (LLM) to enable a powerful LLM to recognize, understand, and generate audio. However, during speech recognition in noisy environments, we observed the presence of illusions and repetition issues in audio-LLM, leading to substitution and insertion errors. This paper proposes a transcription prompt-based audio-LLM by introducing an ASR expert as a transcription tokenizer and a hybrid Autoregressive (AR) Non-autoregressive (NAR) decoding approach to solve the above problems. Experiments on 10k-hour WenetSpeech Mandarin corpus show that our approach decreases 12.2% and 9.6% CER relatively on Test_Net and Test_Meeting evaluation sets compared with baseline. Notably, we reduce the decoding repetition rate on the evaluation set to zero, showing that the decoding repetition problem has been solved fundamentally.

A Transcription Prompt-based Efficient Audio Large Language Model for Robust Speech Recognition

TL;DR

This work tackles robustness gaps in audio-LLMs for speech recognition by introducing a transcription prompt-based framework that fuses an ASR expert transcription tokenizer with a decoder-only LLM. A two-stage training pipeline trains a transcription tokenizer and then the speech encoder plus adapter while freezing the LLM, with a loss that alternates between prompting and non-prompt objectives via a lambda gate. Decoding combines autoregressive and non-autoregressive strategies through a Hybrid AR NAR approach, using a repetition-detection threshold to switch modes and drastically reduce repetition while speeding up inference. Evaluations on 10k-hour WenetSpeech Mandarin show CER reductions around 12% on Test_Net and 9.6% on Test_Meeting relative to baselines, with zero sentence-level repetition and demonstrated robustness to domain shifts (AISHELL-1). Overall, the method offers a practical path to robust, efficient speech recognition in audio-LLMs by leveraging transcription prompts and hybrid decoding.

Abstract

Audio-LLM introduces audio modality into a large language model (LLM) to enable a powerful LLM to recognize, understand, and generate audio. However, during speech recognition in noisy environments, we observed the presence of illusions and repetition issues in audio-LLM, leading to substitution and insertion errors. This paper proposes a transcription prompt-based audio-LLM by introducing an ASR expert as a transcription tokenizer and a hybrid Autoregressive (AR) Non-autoregressive (NAR) decoding approach to solve the above problems. Experiments on 10k-hour WenetSpeech Mandarin corpus show that our approach decreases 12.2% and 9.6% CER relatively on Test_Net and Test_Meeting evaluation sets compared with baseline. Notably, we reduce the decoding repetition rate on the evaluation set to zero, showing that the decoding repetition problem has been solved fundamentally.
Paper Structure (13 sections, 5 equations, 2 figures, 4 tables, 1 algorithm)

This paper contains 13 sections, 5 equations, 2 figures, 4 tables, 1 algorithm.

Figures (2)

  • Figure 1: The overview of our audio-LLM architecture.
  • Figure 2: NAR decoding approach combined with transcription prompt.