Integrating Pre-Trained Speech and Language Models for End-to-End Speech Recognition
Yukiya Hono, Koh Mitsuda, Tianyu Zhao, Kentaro Mitsui, Toshiaki Wakatsuki, Kei Sawada
TL;DR
This work presents Nue-ASR, a fully end-to-end ASR framework that unifies a pre-trained speech representation model (HuBERT) with a decoder-only large language model (GPT-NeoX) through a trainable bridge. The system performs next-token generation conditioned on speech prompts, trained with a causal LM objective and optional CTC supervision, and supports parameter-efficient fine-tuning via LoRA. Empirical results on Japanese datasets show competitive CERs with strong in-domain performance and notable gains from domain adaptation, while ablations reveal the importance of GPT fine-tuning and CTC-based bridge compression for robustness. The approach highlights the potential to leverage state-of-the-art speech and language pre-trained models in a unified, adaptable, and potentially multilingual ASR framework, with implications for end-to-end multimodal tasks.
Abstract
Advances in machine learning have made it possible to perform various text and speech processing tasks, such as automatic speech recognition (ASR), in an end-to-end (E2E) manner. E2E approaches utilizing pre-trained models are gaining attention for conserving training data and resources. However, most of their applications in ASR involve only one of either a pre-trained speech or a language model. This paper proposes integrating a pre-trained speech representation model and a large language model (LLM) for E2E ASR. The proposed model enables the optimization of the entire ASR process, including acoustic feature extraction and acoustic and language modeling, by combining pre-trained models with a bridge network and also enables the application of remarkable developments in LLM utilization, such as parameter-efficient domain adaptation and inference optimization. Experimental results demonstrate that the proposed model achieves a performance comparable to that of modern E2E ASR models by utilizing powerful pre-training models with the proposed integrated approach.
