A Transcription Prompt-based Efficient Audio Large Language Model for Robust Speech Recognition

Yangze Li; Xiong Wang; Songjun Cao; Yike Zhang; Long Ma; Lei Xie

A Transcription Prompt-based Efficient Audio Large Language Model for Robust Speech Recognition

Yangze Li, Xiong Wang, Songjun Cao, Yike Zhang, Long Ma, Lei Xie

TL;DR

This work tackles robustness gaps in audio-LLMs for speech recognition by introducing a transcription prompt-based framework that fuses an ASR expert transcription tokenizer with a decoder-only LLM. A two-stage training pipeline trains a transcription tokenizer and then the speech encoder plus adapter while freezing the LLM, with a loss that alternates between prompting and non-prompt objectives via a lambda gate. Decoding combines autoregressive and non-autoregressive strategies through a Hybrid AR NAR approach, using a repetition-detection threshold to switch modes and drastically reduce repetition while speeding up inference. Evaluations on 10k-hour WenetSpeech Mandarin show CER reductions around 12% on Test_Net and 9.6% on Test_Meeting relative to baselines, with zero sentence-level repetition and demonstrated robustness to domain shifts (AISHELL-1). Overall, the method offers a practical path to robust, efficient speech recognition in audio-LLMs by leveraging transcription prompts and hybrid decoding.

Abstract

Audio-LLM introduces audio modality into a large language model (LLM) to enable a powerful LLM to recognize, understand, and generate audio. However, during speech recognition in noisy environments, we observed the presence of illusions and repetition issues in audio-LLM, leading to substitution and insertion errors. This paper proposes a transcription prompt-based audio-LLM by introducing an ASR expert as a transcription tokenizer and a hybrid Autoregressive (AR) Non-autoregressive (NAR) decoding approach to solve the above problems. Experiments on 10k-hour WenetSpeech Mandarin corpus show that our approach decreases 12.2% and 9.6% CER relatively on Test_Net and Test_Meeting evaluation sets compared with baseline. Notably, we reduce the decoding repetition rate on the evaluation set to zero, showing that the decoding repetition problem has been solved fundamentally.

A Transcription Prompt-based Efficient Audio Large Language Model for Robust Speech Recognition

TL;DR

Abstract

A Transcription Prompt-based Efficient Audio Large Language Model for Robust Speech Recognition

Authors

TL;DR

Abstract

Table of Contents

Figures (2)