Table of Contents
Fetching ...

Transducer-Llama: Integrating LLMs into Streamable Transducer-based Speech Recognition

Keqi Deng, Jinxi Guo, Yingyi Ma, Niko Moritz, Philip C. Woodland, Ozlem Kalinli, Mike Seltzer

TL;DR

This work tackles streaming automatic speech recognition with large language models by introducing Transducer-Llama, a framework that embeds a fixed LLM as the non-blank predictor inside a Factorized Transducer to enable online decoding. To address data sparsity and training cost from large LLM vocabularies, it proposes a vocabulary adaptation strategy that aligns the LLM to a compact ASR vocabulary and fixes the LLM during training. A weak-to-strong LM swap is then employed, training with a weak predictor and replacing it with a strong LLM at decoding, followed by MWER-based fine-tuning to optimize the LLM integration. Empirical results on LibriSpeech and MLS show sizable relative WER reductions over both RNN-T and FT baselines, demonstrating effective, low-latency, LLM-enabled streaming ASR. The approach provides a practical path to leverage powerful LLMs in real-time speech systems while controlling memory, training cost, and latency.

Abstract

While large language models (LLMs) have been applied to automatic speech recognition (ASR), the task of making the model streamable remains a challenge. This paper proposes a novel model architecture, Transducer-Llama, that integrates LLMs into a Factorized Transducer (FT) model, naturally enabling streaming capabilities. Furthermore, given that the large vocabulary of LLMs can cause data sparsity issue and increased training costs for spoken language systems, this paper introduces an efficient vocabulary adaptation technique to align LLMs with speech system vocabularies. The results show that directly optimizing the FT model with a strong pre-trained LLM-based predictor using the RNN-T loss yields some but limited improvements over a smaller pre-trained LM predictor. Therefore, this paper proposes a weak-to-strong LM swap strategy, using a weak LM predictor during RNN-T loss training and then replacing it with a strong LLM. After LM replacement, the minimum word error rate (MWER) loss is employed to finetune the integration of the LLM predictor with the Transducer-Llama model. Experiments on the LibriSpeech and large-scale multi-lingual LibriSpeech corpora show that the proposed streaming Transducer-Llama approach gave a 17% relative WER reduction (WERR) over a strong FT baseline and a 32% WERR over an RNN-T baseline.

Transducer-Llama: Integrating LLMs into Streamable Transducer-based Speech Recognition

TL;DR

This work tackles streaming automatic speech recognition with large language models by introducing Transducer-Llama, a framework that embeds a fixed LLM as the non-blank predictor inside a Factorized Transducer to enable online decoding. To address data sparsity and training cost from large LLM vocabularies, it proposes a vocabulary adaptation strategy that aligns the LLM to a compact ASR vocabulary and fixes the LLM during training. A weak-to-strong LM swap is then employed, training with a weak predictor and replacing it with a strong LLM at decoding, followed by MWER-based fine-tuning to optimize the LLM integration. Empirical results on LibriSpeech and MLS show sizable relative WER reductions over both RNN-T and FT baselines, demonstrating effective, low-latency, LLM-enabled streaming ASR. The approach provides a practical path to leverage powerful LLMs in real-time speech systems while controlling memory, training cost, and latency.

Abstract

While large language models (LLMs) have been applied to automatic speech recognition (ASR), the task of making the model streamable remains a challenge. This paper proposes a novel model architecture, Transducer-Llama, that integrates LLMs into a Factorized Transducer (FT) model, naturally enabling streaming capabilities. Furthermore, given that the large vocabulary of LLMs can cause data sparsity issue and increased training costs for spoken language systems, this paper introduces an efficient vocabulary adaptation technique to align LLMs with speech system vocabularies. The results show that directly optimizing the FT model with a strong pre-trained LLM-based predictor using the RNN-T loss yields some but limited improvements over a smaller pre-trained LM predictor. Therefore, this paper proposes a weak-to-strong LM swap strategy, using a weak LM predictor during RNN-T loss training and then replacing it with a strong LLM. After LM replacement, the minimum word error rate (MWER) loss is employed to finetune the integration of the LLM predictor with the Transducer-Llama model. Experiments on the LibriSpeech and large-scale multi-lingual LibriSpeech corpora show that the proposed streaming Transducer-Llama approach gave a 17% relative WER reduction (WERR) over a strong FT baseline and a 32% WERR over an RNN-T baseline.

Paper Structure

This paper contains 15 sections, 1 equation, 1 figure, 4 tables.

Figures (1)

  • Figure 1: Illustration of the Transducer-Llama framework. $\bm{\oplus}$ denotes addition. LLM parameters are fixed (frozen). The output and embedding layers in red are initialised based on the proposed vocabulary adaptation method.