Table of Contents
Fetching ...

Integrating Pre-Trained Speech and Language Models for End-to-End Speech Recognition

Yukiya Hono, Koh Mitsuda, Tianyu Zhao, Kentaro Mitsui, Toshiaki Wakatsuki, Kei Sawada

TL;DR

This work presents Nue-ASR, a fully end-to-end ASR framework that unifies a pre-trained speech representation model (HuBERT) with a decoder-only large language model (GPT-NeoX) through a trainable bridge. The system performs next-token generation conditioned on speech prompts, trained with a causal LM objective and optional CTC supervision, and supports parameter-efficient fine-tuning via LoRA. Empirical results on Japanese datasets show competitive CERs with strong in-domain performance and notable gains from domain adaptation, while ablations reveal the importance of GPT fine-tuning and CTC-based bridge compression for robustness. The approach highlights the potential to leverage state-of-the-art speech and language pre-trained models in a unified, adaptable, and potentially multilingual ASR framework, with implications for end-to-end multimodal tasks.

Abstract

Advances in machine learning have made it possible to perform various text and speech processing tasks, such as automatic speech recognition (ASR), in an end-to-end (E2E) manner. E2E approaches utilizing pre-trained models are gaining attention for conserving training data and resources. However, most of their applications in ASR involve only one of either a pre-trained speech or a language model. This paper proposes integrating a pre-trained speech representation model and a large language model (LLM) for E2E ASR. The proposed model enables the optimization of the entire ASR process, including acoustic feature extraction and acoustic and language modeling, by combining pre-trained models with a bridge network and also enables the application of remarkable developments in LLM utilization, such as parameter-efficient domain adaptation and inference optimization. Experimental results demonstrate that the proposed model achieves a performance comparable to that of modern E2E ASR models by utilizing powerful pre-training models with the proposed integrated approach.

Integrating Pre-Trained Speech and Language Models for End-to-End Speech Recognition

TL;DR

This work presents Nue-ASR, a fully end-to-end ASR framework that unifies a pre-trained speech representation model (HuBERT) with a decoder-only large language model (GPT-NeoX) through a trainable bridge. The system performs next-token generation conditioned on speech prompts, trained with a causal LM objective and optional CTC supervision, and supports parameter-efficient fine-tuning via LoRA. Empirical results on Japanese datasets show competitive CERs with strong in-domain performance and notable gains from domain adaptation, while ablations reveal the importance of GPT fine-tuning and CTC-based bridge compression for robustness. The approach highlights the potential to leverage state-of-the-art speech and language pre-trained models in a unified, adaptable, and potentially multilingual ASR framework, with implications for end-to-end multimodal tasks.

Abstract

Advances in machine learning have made it possible to perform various text and speech processing tasks, such as automatic speech recognition (ASR), in an end-to-end (E2E) manner. E2E approaches utilizing pre-trained models are gaining attention for conserving training data and resources. However, most of their applications in ASR involve only one of either a pre-trained speech or a language model. This paper proposes integrating a pre-trained speech representation model and a large language model (LLM) for E2E ASR. The proposed model enables the optimization of the entire ASR process, including acoustic feature extraction and acoustic and language modeling, by combining pre-trained models with a bridge network and also enables the application of remarkable developments in LLM utilization, such as parameter-efficient domain adaptation and inference optimization. Experimental results demonstrate that the proposed model achieves a performance comparable to that of modern E2E ASR models by utilizing powerful pre-training models with the proposed integrated approach.
Paper Structure (27 sections, 3 equations, 4 figures, 9 tables)

This paper contains 27 sections, 3 equations, 4 figures, 9 tables.

Figures (4)

  • Figure 1: Overview of the proposed model. All modules of the speech encoder, bridge network, and LLM, except the convolutional waveform encoder, are simultaneously optimized in an E2E manner.
  • Figure 2: The details of the bridge network. In (b) CTC remove and (c) CTC average, a dedicated softmax layer is placed on top of the HuBERT encoder as a CTC branch. An additional CTC loss is also introduced.
  • Figure 3: Letter-value plot of Levenshtein distance for ablation study of the proposed model. The IDs of each model are the same as the IDs in the Table \ref{['table:exp_abl']}.
  • Figure 4: Letter-value plot of Levenshtein distance for the comparison of the proposed model with the publicly available ASR models. The proposed model employed greedy search, while the comparative models used beam search with a beam size of 5. "DS" denotes DeepSpeed-Inference.